CEAN Language Extension and Programming Model

CEAN Language Extension and Programming Model CEAN is an array language extension to C/C++, providing array section notations for SIMD vector parallelism and parallel function maps for multi-threading. In this section, the syntax and semantics of the new constructs are defined and their usage is illustrated with examples. Array Section Notation In your application code, introduce a section operator in the standard C/C++ language, as follows: section_operator ::= [ : : ] where the , , and are of integer types, representing a set of integer values as follows: , >, …, + ( - 1) * A section operator can occur in place of a subscript operator. The following example is an array section with len elements: a[lb], a[lb + str], a[lb + 2*str], …, a[lb + (len-1)*str].a[lb:len:str] Length is chosen instead of upper bound because, in declarations, C/C++ deals with lengths of arrays and not upper bounds. Use of length also makes it easier to ensure that array sections match in size. Successive section operators designate a sub-array of a multidimensional array object. When absent, the defaults to 1. If the is less than 1, the array section is undefined. You can also use [:] as a short hand for a whole array dimension if the size of the dimension is known from the array declaration. If either or must be specified, you must specify both. Example a[0:3][0:4] // refers to 12 elements in the two-dimensional array a, starting at row 0, column 0, and ending at row 2, column 3. b[0:2:3] // refers to elements 0 and 3 of the one-dimensional array b. b[:] //refers to the entire array b. Array Declarations for Array Notations For the array section notation to be useful, the compiler must know the shape and size of an array object. The table below summarizes the different ways of declaring arrays and pointers to array objects in C/C++. Length Storage Class Declaration Fixed Static static int a[16][128] Auto void foo(void) { Length Storage Class Declaration int a[16][128]; } Parameter void bar(int a[16][128]); Heap int (*p2d)[128]; Variable Auto void foo(int m, int n) { (C99) int a[m][n]; } Parameter void bar(int m, int n, int a[m][n]); Heap void bar(int m, int n) { int (*p2d)[n]; }

The variable length array (VLA) notation is a C99 [ISO/IEC 9899] extension. It is supported by the GNU GCC and Intel® compilers.

Note You must use –std=c99 (Linux* and Mac OS* X) or /Qstd=c99 (Windows*) compiler option for the compiler to accept the C99 extensions. If the base of an array section has incompletely specified dimensions (such as a pointer variable), you must explicitly specify the length of the array section. This is illustrated in the following example. Example typedef int (*p2d)[128]; p2d p = (p2d) malloc(sizeof(int)*rows*128); p[:][:] // error p[0:rows][:] // ok Operator Maps Most C/C++ operators are available for array sections: +, -, *, /, %, <, ==, >, <=, !=, >=, ++, --, |, &, ^, &&, ||, !, -(unary), + (unary), +=, -=, *=, /=, *(pointer dereferencing). Operators are implicitly mapped to all elements of the array section operands. The operations on different elements can be executed in parallel with-out any ordering constraints. Array operands in an operation must have the same rank and size. Rank is defined as the number of array section operators, and size is the length of each array section. A scalar operand is automatically filled to the whole array section of any rank. Example a[:] * b[:] // element-wise multiplication a[3:2][3:2] + b[5:2][5:2] // matrix addition of the 2x2 matrices in a and Example b starting at a[3][3] and b[5][5] a[0:4][1:2] + b[1:2][0:4] // error, different rank sizes a[0:4][1:2] + b[0][1] // ok, adds a scalar b[0][1] to an array section. Assignment Maps The assignment operator applies in parallel to every element of the array section on the left hand side (LHS). Example a[:][:] = b[:][2][:] + c; e[:] = d; e[:] = b[:][1][:]; // error, different rank a[:][:] = e[:]; // error, different rank The right hand side (RHS) of an assignment is evaluated before any element on the left hand side is stored. The compiler will introduce necessary temporary arrays to ensure this semantics. Example a[1:s] = a[0:s] + 1; // use old value of a[1:s-1] Because the RHS is executed before any assignment to the LHS, the compiler can vectorize the RHS computation even if some operand on the RHS may alias with the L- value on the LHS. Gather and Scatter When an array section occurs directly under a subscript expression, it designates a set of elements indexed by the values of the array section. Example unsigned index[10]; float out[10], in[10]; out[0:5] = in[index[0:5]]; // gather out[index[5:5]] = in[0:5]; //scatter If the index values in a scatter array section overlap with each other, the values for the duplicated locations must be the same, otherwise, the final stored value after the scatter is undefined. On target architectures with hardware scatter/gather support, the compiler will map the CEAN operations to the appropriate hardware instructions. Reductions A reduction combines array section elements to generate a scalar result. CEAN supports reductions on array sections. It defines a generic reduction function that applies a user- defined dyadic function. It also has nine built-in common reduction functions. The built- in functions are polymorphic functions that accept int, float, and other C basic data type arguments. The names and descriptions of reduction functions are summarized in the table below. Reduction Function Prototypes Function Prototypes Descriptions __sec_reduce(fun, identity, Generic reduction function. Reduces fun a[:]) across the arraya[:] using identity as the initial value. __sec_reduce_add(a[:]) Built-in reduction function. Adds values passed as arrays __sec_reduce_mul(a[:]) Built-in reduction function. Multiplies values passed as arrays __sec_reduce_all_zero(a[:]) Built-in reduction function. Tests that arrary elements are all zero __sec_reduce_all_nonzero(a[:]) Built-in reduction function. Tests that array elements are all non-zero __sec_reduce_any_nonzero(a[:]) Built-in reduction function. Tests for any array element that is non-zero __sec_reduce_min(a[:]) Built-in reduction function. Determines the minimum value of array elements __sec_reduce_max(a[:]) Built-in reduction function. Determines the maximum value of array elements __sec_reduce_min_ind(a[:]) Built-in reduction function. Determines the index of minimum value of array elements __sec_reduce_max_ind(a[:]) Built-in reduction function. Determines the index of maximum value of array elements The reduction operation in CEAN can reduce on multiple ranks. The number of ranks reduced depends on the execution context. For a given execution context of rank m and a reduction array section argument with rank n, where n>m, the last n-m ranks of the array section argument are reduced. Example sum = __sec_reduce_add(a[:][:]); // sum across the whole array a sum_of_column[:] = __sec_reduce_add(a[:][:]); // sum across the column of a Function Maps Maps are implicitly defined on scalar functions. All the array section arguments in a scalar function map call must have the same rank. Scalar arguments are automatically filled match any rank. Example a[:] = sin(b[:]); a[:] = pow(b[:], c); // b[:]**c a[:] = pow(c, b[:]); // c**b[:] a[:] = foo(b[:]); // user defined function a[:] = bar(b[:], c[:][:]); //error, different ranks Elemental function calls are executed in parallel for all the elements, with no specific ordering. Elemental functions may have side effects. When there are conflicts during parallel execution, such as setting an error code, you are responsible for its correctness. Function maps are powerful tools used to apply a set of operations in parallel to all elements of an array section. The compiler takes advantage of function maps to generate multi-threaded parallel calls. Many vendors supply vectorized libraries for commonly used math functions. The compiler automatically replaces the scalar math function with the corresponding SIMD library calls in addition to multi-threading. Passing Array Section Arguments CEAN supports a vector kernel style of programming, where vector code is encapsulated within a function by declaring array parameters of fixed or parameterized vector lengths. The address of the first element of an array section can be passed as argument to an array parameter. The following example illustrates how to combine CEAN vectorization inside function body with OpenMP* threading for parallel function calls. Example void saxpy_vec(int m, float a, float x[m], float y[m]){ y[:]+=a*x[:]; } void main(void){ int a[2048], b[2048] ; #pragma omp parallel for (int i = 0; i < 2048; i +=256){ saxpy_vec(256, 2.0, &a[i], &b[i]); } } By writing the function explicitly with array arguments, you can write portable vector codes using any threading runtime and scheduler. Limitations There are two limitations on the usage of array sections:  Array sections cannot occur inside the conditional test of an if statement.  The function cannot return array section values. More often than not you can convert an if statement into a C/C++ conditional select operation in order to use array sections as shown in the following example. Example for (int i = 0; i < n; i++){ if (a[i] > b[i]){ c[i] = a[i] - b[i]; } else{ d[i] = b[i] - a[i]; } } //can be rewritten as c[0:n] = (a[0:n] > b[0:n]) ? a[0:n] - b[0:n] : c[0:n]; d[0:n] = (a[0:n] <= b[0:n]) ? b[0:n] - a[0:n] : d[0:n]; You can also pass a return array pointer argument to a function to achieve the same effect as a function return array value. Programming Hints and Examples CEAN allows you to write an application with mixed loop and array operations. The compiler uses the array operations in the program to guide vectorization. The following example implements an FIR filter. The scalar code consists of a doubly nested loop where both the inner and outer loop can be vectorized. By writing the program in different ways using array notation, you can direct the compiler to vectorize differently. Example: FIR Scalar Code for (i=0; i #define VLEN 4 #define NROWS 4

//------// Vector kernel // for each grid // o[x][y] = (i[x-1][y-1] + i[x-1][y]+ i[x-1][y+1] + // i[x][y-1] + i[x][y] + i[x][y+1] + // i[x+1][y-1] + i[x+1][y] + i[x+1][y+1])/9; // written with: // 1) VLEN columns for vectorization // 2) NROWS rows for the reuse of the adjacent row loads //------void nine_point_average(int h, int w, int i, int j, float in[h][w], float out[h][w]) { float m[NROWS][VLEN]; m[:][:] = in[i:NROWS][j:VLEN]; m[:][:] += in[i+1:NROWS][j:VLEN]; m[:][:] += in[i+2:NROWS][j:VLEN]; m[:][:] += in[i:NROWS][j+1:VLEN]; m[:][:] += in[i+1:NROWS][j+1:VLEN]; m[:][:] += in[i+2:NROWS][j+1:VLEN]; m[:][:] += in[i:NROWS][j+2:VLEN]; m[:][:] += in[i+1:NROWS][j+2:VLEN]; m[:][:] += in[i+2:NROWS][j+2:VLEN]; out[i:NROWS][j:VLEN] = 0.1111f * m[:][:]; }

//------// caller //------const int width = 512; const int height = 512; typedef float (*p2d)[]; int main() { p2d src = (p2d) malloc(width*height*sizeof(float)); p2d dst = (p2d) malloc(width*height*sizeof(float));

// … // perform average over 9 points #pragma omp parallel for (int i = 0; i < height - NROWS - 3; i += NROWS) { for (int j = 0; j < width - VLEN - 3; j += VLEN) { nine_point_average(height, width, i, j, src, dst); } } Example #include #define VLEN 4 #define NROWS 4

//------// Vector kernel // for each grid // o[x][y] = (i[x-1][y-1] + i[x-1][y]+ i[x-1][y+1] + // i[x][y-1] + i[x][y] + i[x][y+1] + // i[x+1][y-1] + i[x+1][y] + i[x+1][y+1])/9; // written with: // 1) VLEN columns for vectorization // 2) NROWS rows for the reuse of the adjacent row loads //------void nine_point_average(int h, int w, int i, int j, float in[h][w], float out[h][w]) { float m[NROWS][VLEN]; m[:][:] = in[i:NROWS][j:VLEN]; m[:][:] += in[i+1:NROWS][j:VLEN]; m[:][:] += in[i+2:NROWS][j:VLEN]; m[:][:] += in[i:NROWS][j+1:VLEN]; m[:][:] += in[i+1:NROWS][j+1:VLEN]; m[:][:] += in[i+2:NROWS][j+1:VLEN]; m[:][:] += in[i:NROWS][j+2:VLEN]; m[:][:] += in[i+1:NROWS][j+2:VLEN]; m[:][:] += in[i+2:NROWS][j+2:VLEN]; out[i:NROWS][j:VLEN] = 0.1111f * m[:][:]; }

//------// caller //------const int width = 512; const int height = 512; typedef float (*p2d)[]; int main() { p2d src = (p2d) malloc(width*height*sizeof(float)); p2d dst = (p2d) malloc(width*height*sizeof(float));

// … // perform average over 9 points #pragma omp parallel for (int i = 0; i < height - NROWS - 3; i += NROWS) { for (int j = 0; j < width - VLEN - 3; j += VLEN) { nine_point_average(height, width, i, j, src, dst); } }