2010 International Conference on Complex, Intelligent and Software Intensive Systems

A Multidimensional Array Slicing DSL for Stream Programming

Pablo de Oliveira Castro1, Stephane´ Louise1 and Denis Barthou2 1 CEA LIST, Embedded Real Time Systems Laboratory, 2 University of Bordeaux - Labri / INRIA Point Courrier 94, Gif-sur-Yvette, F-91191 France 351, cours de la Liberation,´ Talence, F-33405 France {pablo.de-oliveira-castro, stephane.louise}@cea.fr [email protected]

!

Abstract—Stream languages offer a simple multi-core programming of elements on its inputs and produce a fixed number of model and achieve good performance. Yet expressing data rearrange- elements on its outputs. ment patterns (like a block decomposition) in these languages is verbose and error prone. Filters are particular nodes that have only one input In this paper, we propose a high-level to and one output, they represent computation nodes, pos- elegantly describe n-dimensional data reorganization patterns. We show sibly keeping a state through successive firings. Split, how to compile it to stream languages. Dup and Join nodes are nodes that dispatch data through the application. Since the focus of this paper is on data reorganization, we will concentrate on Split, Dup and 1 INTRODUCTION Join nodes. We recall thereafter the main types of data Stream programming languages [1][2][3] are particularly reorganization nodes: well-suited to write efficient parallel programs for multi- Join round-robin (c1 ... cn) : a join round-robin has n core architectures. Fork-join parallelism and pipelines inputs and one output. We associate to each input i a ? are explicitly described by the stream graph, and task consumption rate ci ∈ N . The node fires periodically. In th memory requirements and communication costs may its k firing the node takes cu, where u = (k mod n)+1, be statically extracted from the stream representation, elements on its uth input and writes them on its output. enabling powerful optimization strategies [4][5] for high As in a classic Cyclo Static Data-Flow [6] model, nodes performance. only fire when there are enough elements on their input. Languages such as StreamIt[1] or ΣC[2] are examples Split round-robin S(p1 ... pm) : a split round-robin has of stream languages with optimizing . These m outputs and one input. We associate to each output j ? th compilers analyze the stream communication patterns a production rate pj ∈ N . In its k firing the node takes and simplify them, breaking useless dependencies. These pv, where v = (k mod m)+1, elements on its input, and optimizations rely in particular on the fact that all writes them to the vth output. communication patterns use Split, Duplicate and Join Duplicate (m) has one input and m outputs. Each time nodes. While very expressive, this low-level represen- this node is fired, it takes one element on the input and tation of dataflow reorganizations is very verbose and writes it to every output, duplicating its input m times. error prone. Besides, stream graphs have sources and sinks: In this paper we propose a high-level language for the Source I(l): a source models a program input. It has an description of stream reorganizations. In this language, associated size l. The source node is fired only once and streams are structured through iterators, enabling the writes l elements to its single output. If all the elements construction of complex patterns of communication/re- in the source are the same, the source is constant and organization. We show that these iterators and patterns denoted by the node (l). can then be compiled efficiently into stream graphs Sink O: a sink models a program output, consuming using Split, Duplicate and Join nodes. The language can all elements on its single input. If we never observe the be seen as an extension to stream languages, as such consumed elements, we say the sink is trash and we write we show how it can be integrated with the StreamIt the node T. language (but it could easily be adapted to other stream languages). We have implemented a for this 1.2 Motivating Example language that produces stream graphs. As a motivating example we are going to present an excerpt from a matrix multiplication program that is 1.1 Stream languages shipped with StreamIt distribution 2.1.1. (cf. figure 1). Stream languages model parallel programs with stream In StreamIt the stream graph is described hierarchi- graphs. In this dataflow representation, nodes represent cally, in a textual form: either data reorganization operations between streams • add, is used to chain subgraphs. or filters, and arcs are communications between nodes. • split duplicate, splits the previous output through a Each time a node is fired it will consume a fixed number Duplicate node.

978-0-7695-3967-6/10 $26.00 © 2010 IEEE 913 DOI 10.1109/CISIS.2010.135 f l o a t −>float pipeline 2.2 Grids MatrixMultiply (int x0, int y0, int x1, int y1) { add RearrangeDuplicateBoth(x0, y0, x1, y1); On instances of type shape we can apply the grid op- add MultiplyAccParallel(x0, x0); } erator which is defined by giving on each dimension i f l o a t −>float splitjoin three parameters (li, hi, δi): RearrangeDuplicateBoth (int x0, int y0, int x1, int y1) { split roundrobin(x0 ∗ y0 , x1 ∗ y1 ) ; • li is the lower bound of the grid for dimension i. // the first matrix just needs to get duplicated • h i add DuplicateRows(x1, x0); i is the upper bound of the grid for dimension . • δi is the stride of the grid for dimension i. // the second matrix needs to be transposed first // and then duplicated For each dimension i, we consider the set of points : add RearrangeDuplicate(x0, y0, x1, y1); join roundrobin; li hi } Gi = {δi.k.~ei : ∀k ∈ [| ; |]} f l o a t −>float pipeline δi δi RearrangeDuplicate(int x0, int y0, int x1, int y1) { add Transpose(x1, y1); The elements of a grid are constructed by computing add DuplicateRows(y0, x1∗y1 ) ; } the Cartesian product of the Gi: f l o a t −>float splitjoin Transpose(int x, int y) { G = G ⊗ · · · ⊗ G split roundrobin; 1 d for (int i = 0; i < x; i++) add Identity(); join roundrobin(y); They are lexicographically ordered. This ordering defines } f l o a t −>float pipeline a grid iterator G(n), where G(0) is the first element, G(1) DuplicateRows(int times, int length) { the second, etc. split duplicate; for (int i = 0; i < times; i++) add Identity(); The grid operator uses a standard slicing notation join roundrobin(length ); where li, hi, δi are separated by colons and each dimen- } sion is separated by commas, [l1:h1:δ1, . . . , ld:hd:δd]. The points described by the grid B [2:15:5,0:8:3] for Fig. 1. StreamIt program for matrix multiplication instance are represented on figure 2(a). If the dimensions of a grid are not the same as the dimensions of the shape on which it is applied, a type error is raised. Out of • split roundrobin, splits the previous output through simplicity, it is possible to omit one or more values of the a Split round robin node. triplet; missing values are replaced by sensible default • join roundrobin, joins the previous outputs with a values (0 in place of li, si in place of hi, 1 in place of Join roundrobin node. δi). For instance, the above example could be written As we can observe in figure 1, describing reorganiza- B [2::5,:8:3]. tion of 2D data in StreamIt is quite fastidious. 2.3 Blocks 2 HIGH-LEVELLANGUAGE The block operator can only be applied upon a grid We propose a high-level language that describes data type. A block is a d-dimensional box parametrized reorganization operations on data streams, through the by its min and max coordinates on each dimension: manipulation of shapes and slicing patterns. The lan- (a1:b1, . . . , ad:bd) with ai, bi ∈ Z. guage is build around five concepts: Shapes, Grids, (−1 : 1, 0 : 1) defines a 3 × 2 block B, the points in B Blocks, Iterators described in this section. are lexicographically ordered, obtaining an ordered set:

2.1 Shapes B = {(−1, 0), (0, 0), (1, 0), (−1, 1), (0, 1), (1, 1)}lex The language restructures input streams into multidi- mensional patterns with shapes types. These shapes cor- Blocks must always be applied to a grid of same respond to a multidimentional indexing of the stream dimension using the product (×) operator, elements. B[2::5,:8:3] x ( − 1 : 1 , 0 : 1 ) In the following example, the two input streams, identified by the numbers 0 and 1 and accessed using which describes the points in figure 2(b). If a block does the keyword “input”, are structured into 3 shapes: not have the same dimension as the grid to which it is applied, a type error is raised. shape[10] A = input 0 To apply a block on a grid, we center the block around shape[15,10] B = input 1 each point of the grid and take the resulting set of shape[3,3,3] C = input 0 points. The resulting points, in order, are defined by the Stream 0, is viewed in A as a stream of vectors of length following iterator of ordered sets, 10, in C as a stream of 3 × 3 × 3 cubes and stream 1 is viewed in B as a stream of 15 × 10 matrices. GB(n) = {g + b : ∀b ∈ B}lex More generally, given a view shape [s1, . . . , sd], the view coordinates (x1, . . . , xd) of the first pattern corre- Successive blocks may overlap, for example, d i−1 B [::,0:1:] x (0:1,0:9) , extracts successive blocks of X Y spond to the linearized stream positions xi ∗ sj. columns pairs from A (cf. fig. 2(c)). i=1 j=1 When blocks fall partially or totally outside of the For later patterns, we must take into account the size of shape defined for the current stream, a configurable the previous patterns. default value is returned for missing elements.

914 (a) B [2:15:5,0:8:3] (b) B [2::5,:8:3] x (−1:1, 0:1) (c) B [::,0:1:] x (0:1,0:9)

Fig. 2. Set of points described by, (a) a grid[2], (b) a gridblock[2], (c) an overlapping gridblock[2]. The gradient of colors gives the iterator order (cool colors are first).

2.4 Iterators 3 COMPILATION Shape, grid and gridblock are all instances of the iterator This section presents the compilation of the high-level type. We combine instances of the iterator type to reor- language introduced in the previous section into a ganize our data, using the “for”, “in” and “push” key- stream graph. First we show that we can extract any gridblock[1] words. The “for in” construct iterates over the elements using stream graphs. Then we compile gridblock[d] gridblock[1] of a given iterator. The “push” keyword produces an graphs by composing multiple element or a ordered set of elements on the output. graphs. Finally we show how to handle “for in push” primitives. shape[3] D = input 0 shape[2] E = input 1 3.1 Compilation of 1D gridblock f o d in D: We observe that [l : h : δ] is equivalent to [l : h : δ] × (0 : f o r e in E : 0); therefore compiling grid[1] instances is a special case push e of gridblock[1] compilation. push d We separate gridblock[1] ≡ [l : h : δ] × (a : b) extraction in two steps: 0 0 0 • (cf. sec. 3.1.1), select the region [l : h : 1] where l = produces the elements l − a is the coordinate of the first element required, and h0 = l0 + δ.((h − l) div. δ) + b the coordinate of {E(0),D(0),E(1),D(0),E(0),D(1),E(1),D(1),... } the last element. • (cf. sec. 3.1.2 to 3.1.4), inside this region, extract the blocks [:: δ] × (0 : w), where w = b − a + 1 is the 2.5 Zipping width of the blocks. Because a stream may contain an infinite sequence of We introduce the zip polymorphic operator that enables patterns, it is important for the produced graphs to be us to interleave two iterators, or two ordered set of reused an infinite number of times. We have ensured elements. that after a pattern is consumed there are no left-over zip(A,B) interleaves the elements in the operands, elements in any of the edges. This steady state execution guarantees that the graph can be reused without side-  A( n ) if n ≡ 0(mod.2) effects. Z(n) = 2 B( n−1 ) if n ≡ 1(mod.2) 2 3.1.1 Selecting a region We want to extract the region [l0 : h0 :] from a shape[1] 2.6 Type system of length s. If the region is [0 : s :], we have nothing to do. In the other cases we must either cut some data The operators defined previously have a strict type sys- (when the region is smaller than the shape) or inject tem, ensuring that only correct programs are accepted: some default values (when the region falls outside of the defined shape). These two cases can happen both grid, gridblock,shape ∈ iterator for the upper or lower bound, we are going to detail the process for the lower bound: shape[s1, . . . , sd]: shape[d] 0 • when l = 0, we do nothing; [l1:h1:δ1, . . . , ld:hd:δd]: shape[d] → grid[d] 0 0 • when l < 0 we inject −l default elements using a (a1 : b1, . . . , ad : bd): grid[d] → gridblock[d] Join node and a constant Source; 0 0 for.in : iterator → orderedset • when l > 0 we cut the first l elements using a Split node and a trash Sink. push and zip are polymorphic operators which we can For l0 = −3, h0 = 8 and s = 10 we obtain the graph both use on orderedset or iterators. push returns an IO represented in figure 3. type, since it pushes the elements in its operand to the Once the region is selected, a sequence of blocks can output channel. be extracted from it. Consider w, the width of the blocks,

915 w 2 I S T 8 M δ M 8 M J o C 3

Fig. 6. Complete overlap with missing blocks (marked Fig. 3. Example of graph for grid region extraction with an M) δ1 N w1 (a) J1: 0 (b) J1: 0 1 l1 h1 J2: J2: 1 l2 J3: J3: w2 δ2 (c) J1: 0 1 2 3 (d) J1: 0 1 2 3 J2: 1 2 3 J2: 1 2 3 4 J3: 2 3 J3: 2 3 4 5

h2

Fig. 5. Pipeline filling during complete overlap. Fig. 7. Multidimensional region extraction

w fig. 5(a). Then we duplicate the second element twice and δ, the stride of the grid. Depending on the ratio δ we can distinguish three situations: (with D1) and put it in J1 and J2, as in fig. 5(b). The w pipeline is now at full regime, each element of the stream • δ ≤ 1, no overlap, see section 3.1.2. w is replicated three times (with D2) and put in J1, J2, and • 1 < δ < 2, partial overlap, see section 3.1.3. w J3, as in fig. 5(c). Finally using D3 to duplicate 5, we fill • 2 ≥ , complete overlap, see section 3.1.4. δ the end of the pipeline, as in fig. 5(d). If we observe 3.1.2 Extracting no-overlapping blocks (fig. 4(a)) the pipeline matrix columns in fig. 5(d), we see that taking one element alternatively from rows J1, J2, and In this case, a sequence of n blocks of size w, separated J3 produces the desired blocks on the output. by gaps of size δ − w must be produced in the stream. When (w mod. δ) 6= 0, we have missing blocks on the With a first split (S1) we extract the first block w, then repetition pattern (cf. figure 6). To handle these missing n − 1 blocks+gaps. The first block is produced on the blocks, we use a simple split and trash after the above output, then for each block+gap, we produce the block pipeline pattern. and throw away the gap with (S2+T). 3.2 Compilation of Multidimensional Gridblock 3.1.3 Extracting partial overlapping blocks (fig. 4(b)) Having shown how to generate any of the gridblock[1], n w In this case, blocks of size , with overlaps of size we now generalize the approach to higher dimensions. w − δ (except for the first and last blocks) have to be Multidimensional grids and blocks, are by construc- w − δ extracted. We produce elements at the start of the tion cartesian products of their 1D counterparts. For stream (first edge of S1). After that we extract (second instance the gridblock[2], edge of S1) a sequence of (n − 1) chunks of δ elements (outlined in purple on the figure). [l1 : h1 : δ1, l2 : h2 : δ2] × (a1 : b1, a2 : b2) For each of these chunks, we separate (using S2) the can be decomposed into, overlapping (in stripped green) and non-overlapping parts. Both are produced, but the overlapping part is ([l1 : h1 : δ1] × (a1 : b1)) ⊗ ([l2 : h2 : δ2] × (a2 : b2)) duplicated first (using D1). Finally we produce the re- maining δ elements for the last block (using the third as shown in figure 7. edge of S1). We are going to use this compositional property to compile gridblock[d] graphs from a set of gridblock[1] 3.1.4 Extracting complete overlapping blocks (fig. 4(c)) graphs: In the complete overlap case, we must produce n blocks, 1) We decompose the gridblock[d] expression into its that are overlapped. This case corresponds to filling a 1D components, ([li : hi : δi] × (ai : bi)), with 1 ≤ pipeline. It is the most difficult of the patterns, because i ≤ d. the number of nodes produced depends on the maximal 2) We define the folded size f for dimension dim as: number moverlap of overlapping elements. We show that dim−1 w Y moverlap = min(d δ e, n). This overlap is reached once the f(dim) = si with f(1) = 1 pipeline is full (green stripped blocks), yet the pipeline i=1 must be filled and emptied (purple stripped and pink Which is the number of elements in any hyper- blocks). We are going to demonstrate how to achieve plane obtained by cutting along the dim dimension. m = 3 this when overlap . The approach below can be 3) We compile for every i, the graph G which pro- m i generalized for any value overlap. duces the elements defined by, We start with moverlap joins (here J1, J2, J3). We start filling the pipeline, putting the first element in J1, as in [li.f(i): hi.f(i): δi.f(i)] × (ai.f(i): bi.f(i))

916 ie a Given stream node. of each Join on level and elements Duplicate nesting a the using the duplicates accordingly, and analyzes different push nesting compiler each by Our combined be iterators. construct can push” streams in “for Multiple the Handling Split 3.3 a with build is Join). (which a graph reordering by our a followed of add end we this the achieve at when To stage grid. the Nevertheless of order shape. the the with of working order lexicographic size Then of region. striped ([ the produces size process of elements 7, considering size. figure folded the in δ is of example idea elements the general consume Taking to The modified space. is of but lack for each here that it prove not of scenarios possible three The 4. Fig. 2 l ] 1 • • Let example: following the consider us Let the using elements the extracts process above The graph obtained The 4) × I

: block# nA: in a r o f W hi the chain We 2 { above, B. example is the b” “push of Outer iterator base the above noted and ator generates that iterator The 3 1 h nd rd ,B A, st ( 1 X nB: B in b r o f a 1 a ooelp( overlap No (a) 2 : fetvl eeaigteepce blocks. expected the generating effectively , uhx push δ nC: in c r o f eoetenme feeet nieao X. iterator in elements of number the denote G S1 } : 1 ( ] x w w i uhb push × . b nD: in d r o f δ.(n - 1) ) 2 xrcste1 opnn o dimension for component 1D the extracts δ w ( )) h e fieaosta contain that iterators of set the uhc push a gridblock edefine we , 1 i h etmri ftefiue,but figure), the of margin left the (in S2 : G b G 1 base (δ -w) (δ w ≡ )) i δ. rpst rdc h nlgraph, final the produce to graphs ( w δ G i h o agn ihelements with margin) top the (in n G ems euntebok on blocks the return must we , ( T Outer ≤ d x − J1 ) → extracts 1 o ntne nteexample the in instance, For . 1) b ata vra ( overlap Partial (b) ) G s ( d b 1 O x − = ) scle h aeiter- base the called is , 1 terwlnt) This length). row (the G → · · · gridblock 2 { A extracts I } gridblock block# G and 2 3 1 nd rd st 0 S1 [ w G d Outer base ] δ.(n-1) ([ 1 − ewill We . l [1] 2 extracts δ δ ( S2 w-δ : x δ w extraction. ( ) c h w-δ In . = ) 2 δ. i ( : n D1 2δ-w 917 − h ubro lmnsta aet ehnlddown- handled be to reducing have that therefore stream. elements because elements, of interesting away number is This the throw pipeline. stages the 3.1.2) of those section top (cf. the stages at extraction are no-overlap the the all chain der to which in order after- it duplicate and once region region a ward). the extracting extract and we pattern twice, a possi- factorized duplicating later are of transformations the (instead previous at that placed so place, been ble have nodes duplicate the neces- nodes of extraction. number pattern the the a reduce for and to sary copies data implementation of our number optimized have We Optimizations 3.5 single inputs. its a interleaves using that implemented their node is gathers Split zip that each node Finally a Join are outputs. single inside pushes a the occur using to generate can corresponding concatenated can pushes graphs stream we multiple the push As loop, single graph. a stream extraction, with a block program grid every and for extraction region together all Combining it Putting Join a 3.4 by followed the node of Duplicate iterations of node. a number with the expressed of as iterations loops times of inner the many number replay as the must for as loops we outer times semantics the many in” as “for stream the satisfy To Q w-δ 1) O ooptimize To the in copies unnecessary avoid To of length the compute We • 1 ∈ < J1 { above example the base Inner Outer D w δ } . ( < ( x x ( ) 2 ) x J2 Inner c opeeoelp( overlap Complete (c) ) n htecoetesaeet“uhx.In x”. “push statement the enclose that and O ) δ h e fieaosta r otie by contained are that iterators of set the gridblock n similarly and Outer O ( x ) oho hs prtosaeeasily are operations these of Both . ( x I [ ) Inner d eptecretvlesteady value current the keep ; ] rpsw hoea optimal an choose we graphs

griblock block# 2 3 S1 4 1 Inner nd ( rd th st b = ) Outer 0 δ D3 D2 D1 ( { [1] x C w griblock = ) 1 1 ( } tgs nti or- this In stages. x and ) 2 2 2 , J3 J2 J1 Q Outer 3 3 3 I Inner ∈ [1] 2 Inner 4 4 ≥ J4 graphs, ( x w 5 δ ( ( = ) c ) x = ) ) O I . tcnhv utpeipt.I sisatae ihthe with instantiated is It that difference inputs. the keyword multiple with StreamIt join filter, have a normal can like with is it datafilter, language A high-level code. our keyword mixing StreamIt new a define We pro- multiplication matrix the 4 for sec. graph in gram Compiled 8. Fig. ple oargo,ZLalw oecmlxwalks complex more allows ZPL are region, directions a which language. on to proposed constraints applied less our is to there region Because of comparable terms is in ZPL (stride descriptions of directions power using expressive The data vectors). original of the be around can set that moved arrays first-class a n-dimensional To region are a Regions on them. introduces abstraction. ZPL between data slicings, data communications of describe the repartition and the processors describe globally model[6]. is dataflow It cyclo-static architectures. the SMP[8] on and optimizing proposed. based RAW[4] an the been and for language have stream compiler a variants ex- both many is studied StreamIt[1] and been [7], have tensively models programming Dataflow etc. FIR, a transposition, proposed a filter, our by Gauss needed a used filter, manipulations Sobel data successfully a the have describe to we language the but for requirement space, a paper. this not in connected is produced are graphs it nodes stream this stream that but the removed). graph hierarchically, been that have the filters enforces identity that StreamIt (the pro- simpler 1 StreamIt figure original fact in the gram by in generated in been is have graph would which the 8, producing section, figure previous the y1) described (0:0,0: x ,0:0) of [::,0:1:] zip(l,c) x0 (0: push B stream x and a A [0:1:,::] as A second ces. the matrices, } } R 5 M 4 eragDpiaeoh(intx 0,intx 1) y1 t n i , x1 t n i , y0 t n i , x0 t n i ( RearrangeDuplicateBoth arxutpy(intx 0,intx 1) y1 t n i , x1 t n i , y0 t n i , x0 t n i ( MatrixMultiply t) a o l f , t a o l f ( t a o l f S d utpycPrle 0,x ; ) x0 , x0 ( MultiplyAccParallel add eragDpiaeoh(x 0,x 1); ) y1 , x1 , y0 , x0 ( RearrangeDuplicateBoth n i o j hp 1,y nu 1 input 0 = input B = ] y1 A , x1 [ ] y0 shape , x0 [ shape nA[0:1:,::]x(0:x : ) 0 : 0 , x0 : 0 ( x ] : : , : 1 : 0 [ A in l r o f P[]i aallhg-ee agae loigto allowing language, high-level parallel a is ZPL[9] of lack for examples more provided not have We method the using compiled is body datafilter of The stream a as seen is input first The nB[::,0:1:]x(0:0,0:y : ) y1 : 0 , 0 : 0 ( x ] : 1 : 0 , : : [ B in c r o f − > uhzp(l,c) c , l ( zip push ELATED TI UTPIAINREVISITED MULTIPLICATION ATRIX e n i l e p i p t a o l f S x x − 0 1 itrevsadyed rw oun pairs. column) (row, yields and interleaves .y .y > datafilter D 0 1 S r e t l i f a t a d t a o l f W 1 x x ORKS x x 1 0 x y 1 0 J J si eo example: below in as 1 ieae vrterw of rows the over iterates D oe h oun fB. of columns the over x y x 0 1 datafilter .y J 1 x 1 1 hc helps which , × J { y x 1 0 matri- ×

{ MultiplyAccParallel y 0 918 hs sdi h alb1]adPto[3 languages. Python[13] and Matlab[12] the in used those one. single a succes- into combining operations by array manipulation language sive this array [11], of optimize focus to language main is The multidimen- (SAC) operations. with variant array C C sional functional Assignment a proposes Single which the from is model Array-OL iterations nested streams. The allow multiple not vector. does over but with stride model, defined our a to repeat- is similar and are space which pattern iteration filters filter a of Each concept applied. language the edly The around [10]. build processing is signal for language ming known directions not time. are which compilable execution which in conditions until statically on order depends not the applied is indeed are graphs, ZPL Stream Yet to data. the through [13] [12] [11] [10] [9] [8] [7] [6] [5] [4] [3] [2] [1] in conservative used. nodes are of and to number copies optimized the are data generated unnecessary graphs integrate avoid The example, this StreamIt. an compile as with to and, it data graphs how stream express show to Do- We language to our Language. in nodes described Specific are main Split they as and graphs patterns complicated reorganization Duplicate write the Join, to frees with having This from describing for reorganizations. programmer language novel data a present multidimensional we paper this In C 6 R S .Diz .L hmeli,adL ndr Asrcin for “Abstractions Snyder, L. and Chamberlain, L. B. Deitz, “Cache J. Amarasinghe, S. S. and Rabbah, R. Thies, W. in Sermulins, Networks,” J. Process “Dataflow Parks, “Cycle- T. Peperstraete, and J. Lee and E. Com- Lauwereins, and R. Engels, “Data M. Bilsen, Lueh, G. G.-Y. and Wu, G. Du, Z. Liao, S.-w. Coarse- “Exploiting Amarasinghe, S. and Thies, W. Gordon, David, M. V. and Sirdey, R. Louise, Aussagu C. S. Blanc, F. Goubier, A “StreamIt: T. Amarasinghe, P. S. and Karczmarek, M. Thies, W. EFERENCES ial h lc oain sdfrgisadbok are blocks and grids for used notations slice the Finally borrowed been has type shape multidimensional The program- graphical and model dataflow a is Array-OL G a Rossum, van G. High- for Support Efficient C: Assignment “Single Scholz, S.-B. Array-OL the of “Projection Dumont, P. and Boulet, P. Amar, A. ALB agaeRfrneMna v5 in Manual Setting,” 1996. Reference Functional Language a MATLAB, in Operations Program. Array Level Compu- Networks Network and in Process Model,” Kahn the tation onto Language Specification Environments Supportive and Models Programming Parallel in distribution,” data Systems dynamic Embedded for Tools and in Compilers, Programs,” Languages, Stream on conf. of Optimization Aware IEEE in Dataflow,” static on Applications Optimization Streaming Brook in Multiprocessors,” for Pro- Transformations Stream putation in Systems Parallelism Operating and Pipeline Languages and in Data, grams,” Task, Grained in appear Real- Computing To Performance-Oriented High Applications,” Execute Time to Architectures Core 2008. Rep., Tech. DTSI/SARC/08-466/TG,” “D Construction in Compiler Applications,” on Streaming for Language fiiind agg eProgrammation de Langage du efinition ONCLUSION ´ 1995. , 2003. , 2009. , s .Oao,K rfut n .Dn,“sn Multi- “Using Dinh, Q. and Brifault, K. Ohayon, E. es, ` 2006. , nl of nAcietrlSpotfrProgramming for Support Architectural on Conf. Intl. 2005. , yhnRfrneManual Reference Python nl yp nPrle rhtcue,Algorithms Architectures, Parallel on Symp. Intl. EETas nSga Processing Signal on Trans. IEEE rc fteIt.Sm.o oeGnrto and Generation Code on Symp. Intl. the of Proc. 2002. , rc fteWrso nHigh-Level on Workshop the of Proc. 2006. , rc fIt of nParallel on Conf. Int. of Proc. W eot 1995. Report, CWI , h ahok,Inc., MathWorks, The , rc fteIt.Conf. Intl. the of Proc. Σ ,R E LIST CEA RT C, rc fteACM the of Proc. 1996. , rc fthe of Proc. 2004. , .Funct. J. 2005. ,