Comparability for Optimizing Utilization of Stream Register Files in Stream Processors

Xuejun Yang∗ Li Wang ∗†¶ Jingling Xue† Yu Deng ∗ Ying Zhang∗ National Laboratory for Parallel and Distributed Processing, School of Computer, NUDT, China∗ Programming Languages and Compilers Group, School of Computer Science and Engineering, UNSW, Australia† {lwa n g, jin glin g }@cse.unsw.edu.au {xjyang, yudeng, zhangying}@nudt.edu.cn

Abstract The (programmable) stream processors, such as Imagine [18], A stream processor executes an application that has been decom- Raw [20], Cell [23], AMD FireStream and Merrimac [4], represent posed into a sequence of kernels that operate on streams of data a promising alternative in achieving high performance in media elements. During the execution of a kernel, all streams accessed applications [17, 18, 21]. In addition, stream processing is also well must be communicated through the SRF (Stream Register File), a suited for some scientific applications [4, 23, 25]. processor non-bypassing software-managed on-chip memory. Therefore, op- to host 2.4GB/s timizing utilization of the SRF is crucial for good performance. The Host Stream key insight is that the interference graphs formed by the streams Interface Controller to off-chip in stream applications tend to be comparability graphs or decom- memory 6.4GB/s DDR Memroy Controller posable into a set of multiple comparability graphs. We present a Interface 8GB/s Network 8GB/s compiler algorithm that can find optimal or near-optimal colorings 16GB/s in stream IGs, thereby improving SRF utilization than the First-Fit Stream Stream bin-packing algorithm, the best in the literature. Memory Register File Controller

Categories and Subject Descriptors D.3.4 [Programming Lan- 4GB/s guages]: Processors—compilers and optimization; B.3.2 [Mem- ory Structures]: Design Styles—Primary memory 64GB/s Controller Micro Cluster3 Cluster2 Cluster1 Cluster0 General Terms Algorithms, Languages, Performance

Keywords Stream processor, stream programming, comparability control path graph coloring, software-managed cache data path 1. Introduction Media applications, such as image processing, signal processing, Figure 1: Block diagram of the FT64 stream processor. video and graphics, are becoming an increasingly dominant por- tion of computing workloads today. In contrast with other applica- We have recently designed and fabricated a 64-bit stream pro- tions, media applications exhibit producer-consumer locality with cessor, FT64 [25], for media applications as well as certain scien- little global data reuse, have abundant parallelism and require high tific applications that are also amenable to stream processing. As computation rates (with 10-100 billion operations per second and a shown in Figure 1, like Imagine [18], Cell [23] and Merrimac [4], few to thousands of operations per input data). These characteristics FT64 can be easily mapped to the stream virtual machine architec- are poorly matched to conventional general-purpose programmable ture described in [12]. Such stream processor executes applications architectures that depend on data reuse (captured by hardware- that have been mapped to the stream programming model: a pro- managed caches), cannot exploit the available parallelism and can- gram is decomposed into a sequence of computation-intensive ker- not support high computation rates. On the other hand, special- nels that operate on streams of data elements. Kernels are compiled purpose media-processing processors tailored to one specific ap- to VLIW microprograms to be executed on clusters of ALUs, one plication require significant design effort and are thus difficult to at a time. Streams are stored in the SRF (Stream Register File), a change as applications and/or algorithms evolve. software-managed on-chip memory. Expressing an application as streams exposes its inherent locality and parallelism. Kernels ex- ¶ The work was carried out during Li Wang’s visit to Jingling Xue’s Re- pose kernel locality by keeping temporary values local (in the non- search Group at UNSW during February 2008 – February 2009. shown local register files near the ALUs) and instruction-level par- allelism (exploited by the multiple ALUs in each cluster). Streams expose producer-consumer locality between kernels — enabling Permission to make digital or hard copies of all or part of this work for personal or some output streams produced by a kernel to be consumed by the classroom use is granted without fee provided that copies are not made or distributed next kernel in sequence — and data-level parallelism — enabling for profit or commercial advantage and that copies bear this notice and the full citation different elements of an input stream to be operated on simultane- on the first page. To copy otherwise, to republish, to post on servers or to redistribute ously, one on each ALU cluster, in a SIMD fashion. to lists, requires prior specific permission and/or a fee. Research into advanced compiler technology for stream lan- PPoPP’09, February 14–18, 2009, Raleigh, North Carolina, USA. Copyright c 2009 ACM 978-1-60558-397-6/09/02 ... $5.00 guages and architectures is still at its infancy. Among several chal-

111 lenges posed by stream processing for compilation [5], an efficient 1 complex xmat[2*N], ymat[2*N]; 1 fft(stream a, stream b, 2 complex twiddlemat[log2(2*N)*N]; 2 stream twiddle, allocation of the scarce on-chip SRF is critical to performance. 3 stream a(N), b(N); 3 stream c, stream d) SRF, the nexus of a stream processor, is introduced to capture the 4 stream twiddle(N); 4 { 5 stream c(N), d(N); 5 complex a_tmp, b_tmp, c_tmp, d_tmp; widespread producer-consumer locality in media applications to re- 6 complex twiddle_tmp; 6 dataInit('xmatix.dat', xmat); duce expensive off-chip memory traffic. Unlike conventional reg- 7 dataInit('twiddlematrix.dat', twiddlemat); 7 for (int i = 0; i < N/2; i++) { ister files, however, SRF is non-bypassing, namely, the input and 8 a>>a_tmp; 8 Load(xmat[0, N-1], a); 9 b>>b_tmp; output streams of a kernel must be all stored in the SRF when a 9 Load(xmat[N, 2*N-1], b); 10 twiddle>>twiddle_tmp; 10 for (int i = 0; i < log2(2*N); i+=2) { 11 c<>a_tmp; operate on one strip at a time. Alternatively, some streams can be 15 } 16 b>>b_tmp; 16 Store(a, ymat[0, N-1]); 17 twiddle>>twiddle_tmp; double-buffered [5] or spilled [22] until the data set of every kernel 17 Store(b, ymat[N, 2*N-1]); 18 d<

112 taken by the streams. Such an algorithm can then be used by a E(A)={(x, y) ∈ E | x, y ∈ A}. A subset A ⊆ V of r nodes is stream compiler to produce a final SRF allocation by combining an r-clique if it induces a complete subgraph. A clique is a maximal with live range splitting and strip mining, if necessary. clique if it is not contained in any other clique. A stream program consists of a sequence of loops where each Given an undirected graph G =(V , E) with the function w loop includes a sequence of kernels operating on streams. In a mapping nodes to positively integral weights, the total width stream compiler, all loops are considered separately in SRF allo- (i.e.,S the number of hues) of an interval coloring α, χα(G; w),is cation. As shown in Figure 1, the DRAM controller supports two | x∈V αx |.Thechromatic number χ(G; w) is the smallest width stream-level instructions, Load and Store, that transfer an entire needed to color the nodes in G.Theclique number is defined as stream between off-chip memory and the SRF. In stream programs ω(G; w) = max{w(K) | K is a clique of G}. As a fact, we have: as demonstrated in Figure 2, loads and stores are used to initialize some streams from the global input data residing in off-chip mem- χ(G; w)  ω(G; w) (1) ory and write certain streams to off-chip memory, respectively. The central machinery in our approach to allocating the streams in a loop to the SRF is the traditional interference graph (IG) except 4.1.1 Interval Coloring vs. Acyclic Orientation that it is a weighted (undirected) graph formed by the streams Figure 3 illustrates the equivalence between finding an interval operated on by the kernels in the loop. All streams accessed in coloring and finding an acyclic orientation for a weighted graph. the loop are identified as live ranges to be placed in the SRF. If b z c two live ranges interfere (i.e., overlap), they must be placed in non- Įa Įb Įc ȕ ȕ ȕ overlapping SRF spaces. The live ranges of streams are computed 0 12345678910 by extending the def/use definitions for scalars to streams: Load 0 12345678910 11 12 x:2 Įz Įy Įx ȕx defines a stream, Store uses a stream, and a kernel call (re)defines ȕa ȕy its output streams and uses its input streams. The live range of a x:2 x:2 a:3 b:4 c:2 stream starts from its definition and ends at its last use. Of course, streams are renamed using the SSA (static single assignment) form. a:3 b:4 c:2 a:3 b:4 c:2

After the live ranges have been computed for a loop, its y:1 z:4 weighted (undirected) IG, denoted G, is built in the normal manner, where a weighted node denotes a stream live range whose weight is y:1 z:4 y:1 z:4 the size of the stream and and an edge connects two nodes if their (a) (b) (c) live ranges interfere with each other. (a) (G; w); (b) χα(G; w)=12 (c) χβ (G; w)=10 The SRF allocation problem can be naturally solved as an interval-coloring problem as formalized below. Allocating SRF Figure 3: Two interval colorings α and β of a weighted undirected spaces to stream live ranges in an IG is represented by an assign- graph together with their equivalent acyclic orientations. ment of intervals to the nodes in the IG. Minimizing the span of intervals amounts to minimizing the required SRF size. Let G=(V,E) be an undirected graph. An orientation of G is a function α that assigns every edge a direction such that α(x, y) ∈ G V,E DEFINITION 1. Given a stream IG =( ) with positively {(x, y), (y, x)} for all (x, y) ∈ E.LetGα be the digraph obtained integral node weights w : V → IN (representing stream sizes), an by replacing each edge (x, y) ∈ E with the arc α(x, y).An α G x α interval coloring of maps each node onto an interval x of orientation α is acyclic if Gα contains no directed cycles. the real line of width w(x) such that adjacent nodes are mapped to Every interval coloring α of G induces an acyclic orientation α x, y ∈ E α ∩ α ∅  disjoint intervals, i.e., ( ) implies x y = . such that (x, y) ∈ α if and only if αx >αy i.e., an arc is directed from x to y if and only if αx is to the right of αy for all (x, y) ∈ E. It is well-known that interval coloring is NP-complete. Conversely, an acyclic orientation α of G induces an interval Our IG-based approach is flexible enough to accommodate pre-   coloring α . For a sink node x,letαx =[0,w(x)). Proceeding pass optimizations that are applied earlier to a program either by inductively, for a node y with all its successor nodes already being the programmer or the compiler. One example is to reorder some  colored (i.e., α defined at the successors), let t be the largest loads and stores to overlap memory transfers and kernel execution.  endpoint of their intervals and define αy =[t, t + w(y)). Another is to split some long live ranges (in scientific applications) From an acyclic orientation, we can obtain an interval coloring accomplished by inserting a pair of store and load instructions. We in linear time by a depth-first search. plan to automate their integration with this work in future. The problem of finding optimal colorings is NP-complete. In an optimal coloring, the chromatic number χ(G; w) is related to the 4. Comparability Graph Coloring for Stream IGs notion of heaviest path in an acyclic orientation of G: Section 4.1 recalls the basic results about interval coloring and χ(G; w) = min (max w(μ)) (2) comparability coloring [9], which provide a basis for understand- α∈A(G) μ∈P(α) ing our approach and proving its optimality and near-optimality. Section 4.2 describes our key insight drawn from a careful analysis where A(G) is the set of acyclic orientations of G and P(α) of the structure of stream IGs: a large number of stream IGs are the set of directed paths in an orientation α ∈A(G).Inother comparability graphs, enabling their optimal colorings to be found words, the orientation whose heaviest path is the smallest induces in polynomial time. In Section 4.3, we turn this insight into an al- an optimal coloring. The heaviest-path-based formulation stated in gorithm that can find optimal or near-optimal colorings for well- (2) is exploited in the development of our coloring algorithm for structured media and scientific applications when their stream IGs stream IGs (Section 4.3). are expected to be decomposable into a set of comparability graphs. In Figure 3(b), the heaviest path is x → c → b → z with a total weight of χα(G; w)=12. In Figure 3(c), the heaviest path 4.1 Interval and Comparability Graph Coloring: Basics is c → z → b with a total weight of χβ(G; w)=10. The gap Given a (directed or undirected) graph G =(V , E) and a subset between the two colorings is 2 but can be larger (Figure 22). So A ⊆ V ,theinduced subgraph by A is G(A)=(A, E(A)),where there is a need to look for an optimal solution efficiently in practice.

113 G3 can thus be found in polynomial time. This result is proved easily G1 by a straightforward application of Theorem 2. Figure 5 shows the IG for a series of three kernels, where all live ranges are no longer than two kernel calls. In particular, stream q u, v w G2 is live from kernel ‘1’ to kernel ‘2’, streams and are live in kernels ‘2’ and ‘3’, and the remaining streams are only live at the G0[G1,G2,G3] G0 G1 G2 G3 kernels where they are operated on. In this example and the proofs of our results, whether a stream is an input or output is irrelevant. Figure 4: An illustration of Definition 4 (n =3). Load(..., p); Kernel('1', p, q); r Due to the equivalence between acyclic orientations and interval Load(..., r); colorings, we also write χα(G; w) to mean the width of the interval Load(..., s); p u q coloring associated with an acyclic orientation α of G. Load(..., t); y Kernel('2', q, r, s, t, u, v, w); s 4.1.2 Comparability Graph Coloring Load(..., x); w Kernel('3', u, v, w, x, y); x In the context of this work, we examine below a class of graphs that store(y, ...); t v allows interval colorings to be found optimally in polynomial time.

DEFINITION 2. An orientation α of an undirected graph G is tran- Figure 5: A stream program and its IG. sitive if (x, z) ∈Gα whenever (x, y), (y, z) ∈Gα. Let Gcg be the IG built from a loop containing Ncg kernels DEFINITION 3. An undirected graph G is a comparability graph if (numbered from 1) such that each live range in Gcg is not longer there exists a transitive orientation of G. than two kernels. We partition all live ranges in Gcg into 2Ncg sets: K ,K ,K ,K ,K ,...,K ,K ,K A transitive orientation is acyclic but the converse is not neces- 1 12 2 23 3 (Ncg−1)Ncg Ncg Ncg1 (3) sarily transitive. In Figure 3, α is not transitive since (x, b), (b, a) ∈ where Ki consists of all streams accessed, i.e., live only in kernel i Gα but (x, a) ∈G/ α.However,β is transitive. Therefore, the graph and Ki(i⊕1) all streams live only in kernels i and i ⊕ 1.Wedefine shown in Figure 3(a) is a comparability graph. i⊕c to be (i+c−1)%N +1and i c to be (i−c−1)%N +1. Let α be a transitive orientation of a comparability graph G.Re- cg cg χ G w  ω G w As illustrated in Figure 6, all streams accessed in a kernel in a stating (1), we have ( ; ) ( ; ). Due to transitivity, every loop form a maximal clique in the stream IG of the loop. path in Gα is contained in a clique of G. In particular, the heavi- Gα G χ G w  est path in equals to the heaviest clique in , i.e., ( ; ) LEMMA 1. The streams in K(i1)i ∪Ki ∪Ki(i⊕1) form a maximal χα(G; w)=ω(G; w). Hence, χα(G; w)=χ(G; w)=ω(G; w). clique for every kernel i. This result is summarized below. r α G u THEOREM 1. For any transitive orientation of , the interval u coloring induced is optimal (and can be found in linear time). q y

s w DEFINITION 4. Let G0 be a graph with n nodes v1 , v2 ,...,vn and pq w x G1 , G2 ,...,Gn be n disjoint graphs. These graphs may be directed v t v or undirected. The composition graph G = G0 [G1 , G2 ,...,Gn ], which is illustrated in Figure 4, is formed formally as follows. First, Kernel 1 Kernel 2 Kernel 3 replace vi in G0 with Gi . Second, for all 1 ≤ i, j ≤ n, make each node of Gi adjacent to each node of Gj whenever vi is adjacent to Figure 6: Kernel-induced cliques for the program in Figure 5. vj in G0 . Formally, for Gi =(Vi,Ei), we define G=(V,E) as: Our main results are stated in two theorems, Theorem 3 is V ∪ inVi = 1 applicable when Ncg is even and Theorem 4 applicable when K ∅ E = ∪1inEi ∪{(x, y) | x ∈ Vi,y ∈ Vj and (vi,vj ) ∈ E0} Ncg1 = , i.e., when cross-iteration reuse is absent. When neither condition holds, we can apply loop unrolling once to produce a THEOREM 2. Let G = G0 [G1 , G2 ,...,Gn ], where all Gi ’s are dis- loop with an even number of kernels so that Theorem 3 can be joint undirected graphs. Then G is a comparability graph if and applied. For stream processors, unrolling a stream program that is only if each Gi (0 ≤ i ≤ n) is a comparability graph. executed on the host processor does not affect negatively program performance. (Code size expansion for the host is not a concern.) Furthermore, the problems of recognizing a comparability G V,E G graph =( ) and finding a transitive orientation of can both THEOREM 3. If Ncg is even, then Gcg is a comparability graph. be done in O(δ·|E |) time and O(| V | + | E |) space, where δ is the maximum degree of a node in G. Based on α, an optimal PROOF. Let us assume first that all sets listed in (3) are not empty. coloring of G can be obtained in linear time (Theorem 1). By construction, the live ranges in every such a set are equal. Thus, the induced subgraph of Gcg by Ki (Ki(i⊕1)) is a clique, denoted 4.2 Optimal Colorings of Comparability Stream IGs Gi (Gi(i⊕1)). So we have the following 2Ncg induced cliques: In stream programs with producer-consumer locality but little G1, G12, G2, G23, G3,...,G N − N , GN , GN 1 (4) global data reuse, the live ranges of streams are also local. A typi- ( cg 1) cg cg cg cal stream program (or a loop in such program) consists of a series In addition, for any two sets K and K listed in (3), either every of kernels, each producing intermediate streams to be consumed live range x ∈ K interferes with every live range x ∈ K or there by the next kernel in sequence. We show below that if all stream is no interference between the live ranges in K and those in K. live ranges in a stream IG do not span across more than two kernel By Theorem 2, in Gcg,ifweletGi (Gi(i⊕1)) “collapse” into calls, then the IG is a comparability graph and its optimal coloring one node, identified by Ki (Ki(i⊕1)), and denote the resulting

114  “decomposed graph” by G0,wehave: PROOF.IfGcg is connected, there must exist two kernels i and j K ,K ,...K ,K G G G ,G ,G ,...,G , G such that i(i⊕1) (i⊕1)(i⊕2) (j2)(j1) (j1)j are in cg = 0[ 1 12 2 Ncg Ncg1]  Gcg and that these are the only sets containing two-kernel long A clique is a comparability graph. Thus, Gi,i=1, 12, 2,...,Ncg1 live ranges listed in (3). Let us consider only the worst case when  given in (4) are all comparability graphs. Then, by Theorem 2, Gcg Ki = Kj = ∅. In a transitive orientation of Gcg, the middle is a comparability graph if we show that G0 is. To achieve this, j − i − 2 sets in the above list must alternate to serve as a source by Definition 3, it suffices if we can find a transitive orientation of or a sink. So there are only two possibilities. In either case, edge (Ki(i⊕1),Ki⊕1) may have at most two orientations, and similarly, K K1 K1 1 edge (Kj1,K(j1)j ) may have at most two orientations. So there K K K12 K 12 K41 12 K41 41 are at most 2 × 2 × 2=8different transitive orientations.  G K2 K4 K2 K4 K2 K4 LEMMA 3. Suppose cg is a comparability graph. If all sets in (3) are nonempty, then G has exactly two transitive orientations.

K23 K K23 K K23 K34 34 34 G  K3 K3 K3 Proof. cg is connected and then apply Lemma 2 (Figure 7). (a) G (b) Two orientations 0 4.3 A General Algorithm for Coloring Stream IGs

Figure 7: Two transitive orientations of G0 (Ncg =4). In some scientific applications (amenable to stream processing), the presence of temporal reuse in a few streams could make their live G 0. As shown in Figure 7, there are exactly two different transitive ranges longer than two kernels. In some media applications, there K ,K ,...,K orientations since 12 23 Ncg 1 mustalternatetobea are also occasionally a few long producer-consumer live ranges. N source or a sink (Lemma 2). This is possible since cg is even. Furthermore, some live ranges may be extended by the programmer Finally, if any set listed in (3) is empty, the decomposed graph or a pre-pass compiler optimization in order to overlap memory G 0 is still a comparability graph since every induced subgraph of a transfers and kernel execution. Such stream IGs may or may not  comparability graph is a comparability graph. be comparability graphs. In this section, we generalize our work described in the preceding section to deal with these stream IGs. THEOREM 4. If KN = ∅,thenG is a comparability graph. cg 1 cg The basic idea is to partition the node set V in G =(V,E) into: G PROOF. A transitive orientation of cg as shown in Figure 7 always V {v ∈ V | v } N K  s = ’s live range spans at most two kernels exists even if cg is odd since the “ring” is broken at Ncg 1. Vl = {v ∈ V | v’s live range spans more than two kernels} In fact, Theorem 4 holds as long as Ki(i⊕1) =0for some i. Let us illustrate Theorem 4 in Figure 8 for the IG in Figure 5. As a result, E is partitioned into the following three subsets: Being a comparability graph, its optimal coloring is guaranteed. The optimality is independent of the node weights in the graph. Es = {(x, y) ∈ E | x ∈ Vs,y ∈ Vs} El = {(x, y) ∈ E | x ∈ Vl,y∈ Vl} K1 r u E { x, y ∈ E | x ∈ Vs,y ∈ V } y sl = ( ) l pqs x K12 K3 w By Theorems 3 and 4, the subgraph G(Vs) induced by Vs is t v a comparability graph. Our key observation is that the long live K2 K23 G0 G1 G12 G2 G23 G3 ranges in stream IGs are sparse and tend not to be live simultane- ously. So we assume that the subgraph G(Vl) is a forest of disjoint p x y trees (which are trivially comparability graphs). In the rare cases K1 q when G(Vl) is not a forest, we may, as part of future work, apply K K12 3 u live range splitting to shorten some live ranges to make it so. r w

K2 K s y 23 v k x t Load(..., k); p Kernel('1', k, l); l w G0 G = G0[G1,G12,G2,G23,G3] m Load(..., m); n v Ip Iy Ix Load(..., n); Kernel('2', l, m, n, o); o u 0 12345678910 11 12 13 Load(..., p); q t rs Load(..., q); Iu Iw Iv Ir It Is Iq G Kernel('3', o, p, q, r); y Load(..., s); k x Kernel('4', r, s, t); l Figure 8: Optimal interval coloring of the stream IG given in Fig- Load(..., u); w p ure 5 (with the weights of p, q and r being 1, the weights of s, t, u Kernel('5', m, t, u, v); n v m Load(..., w); and v being 2 and the weights of w, x and y being 3). To avoid clut- o u Kernel('6', p, v, w, x); tering, in the graph labelled by G0[G1, G12, G2, G23, G3], a thicker Kernel('7', x, y); q t K K r s arrow directing from a clique to a clique symbolizes all di- G(Vs) G(Vl) rected edges (x, y),forallx ∈ K and all x ∈ K. The facts stated in Lemmas 2 and 3 are exploited in the devel- Figure 9: A program with two long live ranges m and p and its IG. opment of our algorithm for coloring stream IGs in Section 4.3. As illustrated in Figure 9, Vl consists of two long live ranges m  LEMMA 2. Suppose Gcg is a comparability graph. Let Gcg be an and p: m is live from kernel 2 to kernel 5 and p is live from kernel 3  induced subgraph of Gcg.IfGcg is connected, then it has at most to kernel 6. Since both streams interfere with each other, the forest eight different transitive orientations. G(Vl) has only one , which is a line connecting m and p.

115 Algorithm 1 A general algorithm for coloring stream IGs. y y k x 1: procedure IG CGC k x p p 2: Input: G =(V,E) with V = {Vs,Vl} and E = {Es,El,Esl} l w l m w m 3: Output:An acyclic orientation,or equivalently,interval coloring α of G n 4: if a transitive orientation α of G can be found then n v v α 5: return o u o u 6: end if q t 7: χmin(G)=+∞ q t r s r s 8: Let Os be the set of all transitive orientations of Es, i.e., G(Vs) trees G V (a) (b) 9: Let Ol be the set of 2 ( ( l)) transitive orientations of El, i.e., G(Vl) 10: for each orientation os × ol ∈Os ×Ol do Figure 11: Two orientations for the program in Figure 1. 11: for (x, y) ∈ Esl,wherex ∈ Vs and y ∈ Vl do o ×o ∈O ×O 12: Direct an arc from y to x (x to y)ify is a source (sink) In Step 3 (lines 10 – 13), for each orientation s l s l 13: end for (line 10), a unique orientation of Esl is determined (lines 11 – 14: Let α be the acyclic orientation of G thus found 13). For each edge (x, y) ∈ Esl,wherex ∈ Vs and y ∈ Vl,its 15: if χα(G) <χmin(G) then orientation is assigned based on the property of y. From Figure 10, χ (G)=χ (G) 16: min α it can be easily observed that y is either a source or a sink under ol. 17: Record the α as the current best If y is a source, direct the edge from y to x, namely, maintain y’s 18: end if property; otherwise, direct the edge from x to y. 19: end for α G α Every orientation of found in line 14 is acyclic. This can 20: return G V 21: end procedure be reasoned about as follows. No directed path confined to ( s) can be a since G(Vs) is a comparability graph. In addition, no G V In Section 4.3.1, we present a polynomial algorithm for coloring directed path that contains a node in ( l) can be a cycle since the our stream IGs. In Section 4.3.2, we argue that why this algorithm node must be either a source or a sink (Figure 10). tends to give optimal and near-optimal colorings in practice. IG CGC is polynomial in practice. For comparability graphs, their recognition and optimal colorings are polynomial. In addition, Gs is mostly connected, resulting in a few orientations (Lemmas 2 trees(G(V )) and 3). Finally, 2 l is a small constant since G(Vl) has few 4.3.1 Algorithm trees. As shown in Algorithm 1, if G is a comparability graph (Defi- Let us apply IG CGC to the program given in Figure 9. In nition 3), then by Theorem 1, an optimal coloring, represented by lines 4 – 6, G is detected to be a comparability graph. Its optimal a transitive orientation, is returned immediately (lines 4 – 6). Oth- coloring is found and returned immediately. Nevertheless, let us use erwise, G is not a comparability graph, in which case, an optimal this example to explain how the remaining steps of the algorithm or near-optimal coloring, represented by an acyclic orientation, is work. G(Vs) is connected and happens to have only two transitive found in three steps. Let trees(G(Vl)) be the number of trees in orientations. As G(Vl) has only one tree, there are two orientations. G(Vl). The basic idea is to enumerate the set Os of all transitive So there are a total of four orientations, two of which are shown in transitive orientations of Es, i.e., G(Vs) (in Step 1) and enumerate Figure 11, to consider in line 10. In this particular example, the one trees(G(V )) the set Ol of all 2 l transitive orientations for the forest El, in Figure 11(a) is a possible solution found since it is transitive. i.e., G(Vl) (in Step 2). As a result, for every possible combination os × ol in Os ×Ol, a unique orientation to Esl is determined (in 4.3.2 Analysis |O |×|O| G Step 3). Among s l acyclic orientations of found, the In this section, we argue that IG CGC finds optimal colorings for one whose heaviest (directed) path is the smallest is returned (lines most stream programs. We show further that non-optimal colorings 15 – 17). This is motivated by the fact stated in (2). occur only infrequently and are near-optimal in the sense that they G V We restrict ourselves to transitive orientations in ( s) and are only larger by the sum of one or two stream sizes in the worst G V ( l) only to both reduce the solution space to be searched for case. Our claim is validated in our experiments in Section 5. and minimize the width of the final interval coloring found. Recall that χ(G; w) denotes the chromatic number of G. In our Os Es In Step 1 (line 8), the set of all transitive orientations of , algorithm, α is the best acyclic orientation found and χα(G; w) is G Vs G Vs i.e., ( ) is found. In real code, ( ) is generally connected, its width. Let Pα be the heaviest directed path in Gα: resulting in exactly two transitive orientations by Lemma 3 as illustrated in Figure 7. There can be only a limited number of Pα =def v1,v2,...,vm (5) transitive orientations when G(Vs) is disconnected by Lemma 2. According to (2), we have χα(G; w)=w(Pα). In addition, w(Pα) is the smallest among the heaviest directed paths in all orientations a g a e f g a e f of G found in line 14 of IG CGC given in Figure 1. b cd IG CGC is optimal for G if χα(G; w)=χ(G; w). e f g All results presented below in this section are formulated and c b c d hib d hi proved based on reasoning about the structure of Pα, which is un- hi covered in Lemma 4, resulting in five cases to be distinguished, Figure 10: Two transitive orientations of a tree. and Lemma 5. In three of the five cases, our algorithm is optimal (Theorem 5). In the remaining two cases, our algorithm is also op- trees G V In Step 2 (line 9), we find all 2 ( ( l)) transitive orientations timal for many stream IGs. Non-optimal solutions α are returned of the trees in G(Vl), i.e., El. There are these many orientations only infrequently when some strict conditions are met, and more- because a tree is a bipartite graph and has exactly two orientations over, these solutions are near-optimal since χα(G; w) − χ(G; w) is (when it has more than one node). As shown in Figure 10, in small for reasonably large stream IGs (Theorems 6 and 7). one orientation, the edges are directed from the nodes at the odd- numbered levels and in the other, the edge directions are reversed. LEMMA 4. Only v1 or vm may appear in the forest G(Vl).

116 PROOF. Follows simply from the fact that for every orientation α available. Table 1 gives a list of 11 media and scientific applica- of G found in line 14 of IG CGC given in Figure 1, every node in tions available to us for the FT64 stream processor. NLAG-5 is a G(Vl) is either a source or a sink under α.  nonlinear algebra solver for two-dimensional nonlinear diffusion of This lemma implies that all nodes in Pα are contained in G(Vs) hydrodynamic. QMR is the core iteration in the QMRCGSTAB al- except its start and end nodes. gorithm for solving nonsymmetric linear systems. LUD is a dense Let Ki = K(i1)i ∪ Ki ∪ Ki(i⊕1). By Lemma 1, Ki is a LU Decomposition solver. As shown, the stream IGs in 10 bench- maximal clique in G(Vs) (as illustrated in Figure 6). marks are comparability graphs. Their optimal colorings are guar- anteed. Initially, the stream IG G for QMR is not a comparability LEMMA 5. v ,v ,...,vm− form a clique Ki for some i. 2 3 1 graph and G(Vl) is not a forest. However, after unrolling its loop G V PROOF. α found by IG CGC is a transitive orientation of G(Vs). with a factor of two and performing some live range splitting, ( l) First, there cannot be two different nodes vi and vj in v2,v3,..., is a forest in the unrolled loop as depicted in Figure 12, to which vm−1,wherei

Case P5. Either v1 or vm is in G(Vl) (but not both) Table 1: Media and scientific programs (C (F) indicates the corre- THEOREM 5. IG CGC is optimal in Cases P1 – P3. sponding stream IG G (G(Vl)) is a comparability graph (forest)).

PROOF.InCaseP1,G(Vs) is a comparability graph. Thus, Pα must s K G V G q2 1 be contained in a clique in ( s) (and also in ). This means r1 that χα(G; w)=w(Pα)=w(K).Sincew(K)  χ(G; w),we r1 p1 must have χα(G; w)  χ(G; w).Soα is optimal. The proof for f2 G V v1 Case P2 is similar since the forest ( l) is also a comparability s2 q0 x graph. In Case P3, if v1 and vm interfere with each other, according 2 q t2 to Lemma 5, it is easily deduced that v1 and vm must interfere with 1 x1 r2 v ,v ,...,vm− Ki all nodes, i.e., 1 2 1,in , and consequently, all the t1 f1 s 0 r0 p2 q3 q4 Pα Pα K G nodes in . Thus must appear together in a clique in . v Thus, we can complete the rest of the proof for Case P3 in a similar 2  way as for Case P1. Figure 12: The IG for the unrolled QMR. THEOREM 6. In Case P4, we have χα(G; w) − χ(G; w)  w(v1)+w(vm), where the equality holds if and only if q2 s1 v2,v3,...vm−1 happen to from the heaviest clique K in G such r1 r1 that χ(G; w)=w(K). p1 f2 PROOF. By Lemma 5, we have χα(G; w) − χ(G; w)  w(v1)+ v1 s2 q0 w(vm). We now prove the “if” and “only if’ for the equality. x2 The“if”partistruesinceχα(G; w)=w(Pα)=w(v1)+ q1 t2 x1 r2 w(vm)+w(K)=w(v1)+w(vm)+χ(G; w). The “only if” t1 f1 s0 r0 p part is true due to Lemma 5 and the given hypothesis χα(G; w) − 2 q3 q4 v2 χ(G; w)=w(v1)+w(vm).  An analogue of Theorem 6 for Case P5 is given below. Figure 13: A transitive orientation of the IG for the unrolled QMR. THEOREM 7. In Case P5, suppose that v1 is contained in G(vl) but vm is not. Then χα(G; w) − χ(G; w)  w(v1), where the equality Below we demonstrate that IG CGC can find optimal and nearly holds if and only if v2,v3,...vm−1 happen to form the heaviest optimal colorings efficiently for a large number of randomly gener- clique K in G such that χ(G; w)=w(K). ated stream IGs that satisfy the characteristics of stream processing. We have implemented an algorithm that randomly generates the 5. Experiments stream IGs that satisfy the stream characteristics exploited in the development of our IG CGC algorithm as discussed in Section 4.3. Research into advanced compiler technology for stream processing is still at its infancy. There are presently no standard benchmarks

117 All random numbers are in discrete uniform distribution generated Optimal Solutions (%) by unidrnd in Matlab unless specified otherwise. Group #Weighted IGs There are five steps involved in generating a stream IG G.Step1 IG CGC First-Fit generates the number of kernels, denoted num kernel.InStep2,we 1 GL 300 299 (99.67%) 55 (18.33%) G Vs generate the set of short live ranges, namely ( ). For each kernel 1 GU 300 298 (99.33%) 115 (38.33%) i, we generate two sets Ki and Ki(i⊕1). We generate a random 2 number in the range [1,3] to represent the number of live ranges GL 300 296 (98.67%) 38 (12.67%) Ki 2 in . Similarly, we generate another random number over [1,3] to GU 300 295 (98.33%) 104 (34.67%) represent the number of live ranges in Ki(i⊕1). As a result, each kernel has at most nine short live range streams live at the kernel: Table 2: Optimality of IG CGC and First-Fit for 1200 IGs. three from K(i1)i, three from Ki and three from Ki(i⊕1). In Step 3, we generate the set of long live ranges, namely G(Vl). We generate a random number p ranging from 1% to 20% to repre- Group #Unweighted Graphs  20%  10% < 0% sent the percentage of long live ranges in G(Vl) over num kernel. 1 GL 30 18 30 1 Thus, |G(Vl)| = p × num kernel. For each long live range i,we 1 generate a random number, length i, over [3,6] to represent the GU 30 0 18 2 i 2 number of kernels spanned by (longer live ranges should be split) GL 30 4 27 2 and another random number over [1, num kernel -length i+1]to 2 GU 30 1 19 1 represent the kernel from which i starts to be live. In Step 4, we keep G if G(Vl) is a forest and go back to Step 1 otherwise. In Step 5, we generate the stream sizes for all live Table 3: Gaps between IG CGC and First-Fit. ranges according to their characteristics in stream applications. In our experiments, node weights are chosen to have two different quality of our solution α found by IG CGC is measured as a gap distributions, Distribution U and Distribution L. We use Distribu- with respect to that found by First-Fit defined as follows: tion U, a discrete uniform distribution, to demonstrate generally the worst-case performance advantages of IC CGC over First-Fit. χ G w − χ G w In this case, node weights are randomly taken from the range [1,6]. G First−Fit( ; ) α( ; ) gap( )= χ G w (6) For each program, we modify Distribution U to obtain Distribution First−Fit( ; ) L by simply replacing each stream size w by 2w. In this second case, the fact that some streams may be geometrically larger than χ G w others in a program is explicitly taken into account. Distribution L where First−Fit( ; ) is the optimal solution (i.e., the smallest width required for coloring all nodes in G) found by First-Fit and is actually a uniform distribution in the logrithmic scale log2. χ G w To test the scalability of IG CGC, we have generated four α( ; ) is the optimal solution found by IG CGC. G1 G1 Table 3 shows the gaps between IG CGC and First-Fit for the groups of stream IGs. Groups U and L consist of IGs with G1 G1 G2 G2 between 3 to 50 kernels with their node weights generated using four groups, L, U , L and U , with 30 unweighted graphs in U L G2 G2 each group. As shown in Column 3, for 18 out of 30 graphs in Distributions and , respectively. Groups U and L consist of G1 larger IGs with between 50 to 100 kernels. Each group consists of Group L, there are always weight assignment(s) achieving a gap exactly 30 different IGs (with their node weights being ignored). of over 20%. The largest for this set of 30 graphs is 41% for a graph For each IG in each group, there are 10 instances of that IG in- consisting of 114 nodes and 455 edges. In addition, as shown in stantiated with different node weights (in Step 5). So each group Column 4, for at least 18 out of 30 graphs in each group, there are consists of 300 different weighted IGs, giving rise to a total of 1200 some weight assignment(s) achieving a gap of over 10%. Finally, IG CGC stream IGs considered in our experiments. Due to space limitation, we observe from Column 5 that may perform slightly 1 2 worse than First-Fit in six different weighted graphs. The gaps for we restrict our discussions to Group GL and Group GU . The node five of these graphs are between −5.69% — −2.78% and the gap counts of the graphs in these two groups are shown in Figures 14 for the remaining one is −13.89%. In general, the gaps depend and 18, respectively, and their edge counts in Figures 15 and 19, on the distribution of the node weights, i.e., the sizes of streams respectively. The tree counts in the forests G(Vl) in these graphs in a program. The major advantage of IG CGC is that if an IG is are shown in Figures 16 and 20, respectively. a comparability graph, then the optimal allocation is guaranteed To check the optimality of IG CGC, we have developed a for- regardless of what its node weights are, and in addition, even when mulation of the SRF allocation problem by integer linear program- an IG is not a comparability graph, IG CGC can still achieve near- ming (ILP). We ran the commercial ILP solver, CPLEX 10.1, to optimal colorings as proved in Theorems 6 and 7 and confirmed find an optimal coloring for each IG. If CPLEX does not terminate by the experimental data in Table 2. However, the performance of in five hours for an IG G, its optimal coloring is estimated opti- First-Fit is sensitive to the structure of an IG and the values of its mistically by (1). Therefore, all optimality results about IG CGC node weights. For example, the gap as shown in Figure 22 is 50%. reported here are conservative. Let us see why IG CGC achieves better SRF allocation than Table 2 shows that IG CGC obtains optimal solutions in 99% of First-Fit. First-Fit places the streams in an IG in a certain order, say, the 1200 IGs in all four groups. On the other hand, the solutions according to the order in which the streams start to live and then from First-Fit are mostly sub-optimal. First-Fit obtains optimal their weights without considering the structure of the IG, resulting solutions in 26% of the 1200 IGs in all four groups. in SRF fragmentation. We demonstrate this with a simple program The near-optimality of IG CGC is achieved efficiently as vali- shown in Figure 22(a). Figure 22(b) shows the SRF allocation dated in our experiments on a 3.2GHz Pentium 4 with 1GB mem- S S ory. The longest time taken is 0.2 seconds for an IG with 409 nodes under First-Fit. The streams 1 and 2 live at kernel 1 are allocated before S3. S1, which is heavier than S2, is allocated first followed and 1836 edges, in which case G(Vl) consists of 8 trees. For most by S2. However, since S2 is also live in kernel 2, S3,whichis of the other IGs, the times elapsed are less than 0.05 seconds each. S S Let us look at the differences between IG CGC and First-Fit in heavier than 1, can only be placed after 2. Based on the IG and the assigned transitive orientation in Figure 22(c), the optimal SRF terms of their allocation results. For a given weighted IG G,the allocation found by IG CGC is shown in Figure 22(d).

118 300 600 200 400 100

#nodes 200 #nodes 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30

1 2 Figure 14: Number of nodes in Group GL. Figure 18: Number of nodes in Group GU .

1000 2000 500 1000 #edges #edges 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30

G1 2 Figure 15: Number of edges in Group L. Figure 19: Number of edges in Group GU .

6 10 4 5

#trees 2 #trees 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30

G1 2 Figure 16: Number of trees in Group L. Figure 20: Number of trees in Group GU .

IG_CGC First_Fit IG_CGC First_Fit 10 10 5 5 0 #optimums 0

0 5 10 15 20 25 30 #optimums 0 5 10 15 20 25 30

Figure 17: Number of optimal solutions found by IG CGC and Figure 21: Number of optimal solutions found by IG CGC and 1 First-Fit in 300 weighted IGs from Group GL. 2 First-Fit in 300 weighted IGs from Group GU . 6. Related Work coloring based register allocators have been developed [1, 3, 7]. Let us examine in more detail the two existing SRF management Recently, Smith et al. [19] present a generalized algorithm for ir- techniques for stream processors [5, 22]. Stream scheduling intro- regular architectures with register aliases and non-disjoint register duced in [5] was earlier implemented in the StreamC compiler to classes. Li et al. [14, 16] apply this generalized algorithm to assign compile stream programs for Imagine [5, 18]. Stream scheduling arrays in embedded programs to scratchpad memory (SPM). Wang associates (if possible) all stream accesses to the same stream with et al. [22] apply it further in SRF allocation as discussed above. the same buffer in the SRF. All such SRF buffers are placed in the Fabri [6] discovered the connection between interval coloring SRF by applying some greedy First-Fit-like bin-packing heuristics. and compile-time memory allocation. Since then some approxima- The key idea is trying to position each buffer at the smallest possi- tion algorithms have been proposed [8, 11]. Lefebvre and Feautrier ble SRF address, always complete the current buffer before starting [13] use interval coloring to minimize the number of data struc- another, and finally, position the largest buffers first so that smaller tures to rename in storage management for parallel programs. By buffers can fill in the cracks. On encountering spills, their algorithm continuing their graph coloring work [14], Li et al. [15] apply in- resorts to double buffering to reduce the sizes of some buffers. Strip terval coloring to assign arrays in embedded programs to SPM. mining is applied before SRF allocation. Govindarajan and Rengarajan [10] studied a compile-time In [22], we apply graph coloring to place streams in the SRF. buffer allocation problem for so-called regular stream flow graphs. This entails a partitioning of the SRF into pseudo registers before In [24], some improvements of [5] to LRFs (Local Register graph coloring can be applied. For some applications, such SRF Files) allocation in stream processor are presented. The LRFs are partitioning introduces aliases among pseudo registers, resulting in register files near the ALU clusters. Unlike SRF, LRFs play a SRF fragmentation and wasted free spaces. The idea of partitioning similar role as the register files in general-purpose processors. a software-managed cache first this way and then applying graph coloring to perform cache allocation was first proposed in [14]. In [26], we focused on how to identify and represent loop-dependent 7. Conclusion reuse between streams in stream programs. This paper presents a new approach to optimizing utilization of In this paper, we present a comparability graph coloring algo- the SRF for stream processors. The key insight is that the interfer- rithm for SRF allocation developed based on a careful analysis of ence graphs (IGs) in media and scientific applications amenable to the characteristics of stream IGs. This new approach relies on nei- stream processing are comparability graphs or decomposable into ther First-Fit heuristics, which can be sub-optimal, nor SRF par- well-structured comparability graphs (like trees). This has moti- titioning, which can cause SRF fragmentation. Our algorithm can vated the development of a new algorithm that is capable of finding achieve near-optimal colorings as validated in our experiments and optimal or near-optimal colorings efficiently, thereby outperform- outperform First-Fit in a large number of stream IGs tested. ing First-Fit heuristics that are presently used in the literature. Graph coloring is a popular technique used in register alloca- This new approach also facilitates its integration with strip min- tion. Based on Chaitin’s original formulation [2], a variety of graph ing, an important optimization for stream processors for improving the performance of a stream application. There are at least two ways

119 Time Time S3 SRF S1 SRF

//weights: S1:16; S2: 8; S3:24; k1 1 2 k1 2 1 kernel('1', S1, S2); k2 2 3 k2 2 3 kernel('2', S2, S3); S2 (a) Program (with loads/stores omitted) (b) Allocation by First-Fit (c) IG (d) Allocation by IG CGC

Figure 22: An example demonstrating the superiority of IG CGC over First-Fit. in which our algorithm can be combined with strip mining. One is [12] Francois Labonte, Peter Mattson, William Thies, Ian Buck, Christos to apply the algorithm to perform SRF allocation for a set of given Kozyrakis, and Mark Horowitz. The stream virtual machine. In PACT strip sizes and return the best according to some performance-based ’04: Proceedings of the 13th International Conference on Parallel cost model. Another is to combine our algorithm and strip mining Architectures and Compilation Techniques, pages 267–277, 2004. to find a good strip size in a symbolical manner. All stream sizes [13] Vincent Lefebvre and Paul Feautrier. Automatic storage management for parallel programs. Parallel Comput., 24(3-4):649–671, 1998. are multiples of a strip size variable. So the best SRF allocation [14] Lian Li, Lin Gao, and Jingling Xue. Memory coloring: A compiler ap- returned by our algorithm gives an upper bound on the strip size proach for scratchpad memory management. In PACT ’05: Proceed- to be used. By combining this upper bound constraint with the ex- ings of the 14th International Conference on Parallel Architectures ecution time estimate of a stream application also expressed as a and Compilation Techniques, pages 329–338, 2005. function of the strip size variable, the best strip size can be solved [15] Lian Li, Quan Hoang Nguyen, and Jingling Xue. Scratchpad alloca- analytically or numerically using, say, MatLab. tion for data aggregates in superperfect graphs. In Proceedings of the Finally, our IG-based algorithm allows other pre-pass optimiza- 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, tions such as live range splitting and stream prefetching to be well and tools for embedded systems, pages 207–216. ACM, 2007. integrated into the same compiler framework. [16] Lian Li, Hui Feng, Quan Hoang Nguyen, Lin Gao, and Jingling Xue. Compiler-directed scratchpad memory management via graph color- ing. ACM Transactions on Architecture and Code Optimization, 2009. To appear. 8. Acknowledgments [17] John D. Owens. Computer Graphics on a Stream Architecture.PhD This work is supported in part by the NSF Grant of China (60621003) thesis, Stanford University, November 2002. and a Chinese Council Scholarship and in part by the Aus- [18] John D. Owens, Ujval J. Kapasi, Peter Mattson, Brian Towles, Ben tralian Research Council Grant (DP0881330) and the UNSW Serebrin, Scott Rixner, and William J. Dally. Media processing ap- Engineering-International Research Collaboration Grant (PS16380). plications on the imagine stream processor. In Proceedings of the IEEE International Conference on Computer Design, pages 295–302, September 2002. References [19] Michael D. Smith, Norman Ramsey, and Glenn Holloway. A gener- alized algorithm for graph-coloring register allocation. In PLDI ’04: [1] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements Proceedings of the ACM SIGPLAN 2004 conference on Programming to graph coloring register allocation. ACM Transactions on Program- language design and implementation, pages 277–288. ACM, 2004. ming Languages and Systems, 16(3):428–455, 1994. [20] Michael Bedford Taylor and Jason Kim et al. The Raw microproces- [2] G. J. Chaitin. Register allocation & spilling via graph coloring. In sor: A computational fabric for software circuits and general-purpose SIGPLAN ’82: Proceedings of the 1982 SIGPLAN symposium on programs. IEEE Micro, 22(2):25–35, 2002. Compiler construction, pages 98–101. ACM Press, 1982. [21] W. Thies, M. Karczmarek, M. Gordon, D. Maze, J. Wong, H. Ho, [3] Fred C. Chow and John L. Hennessy. The priority-based coloring M. Brown, and S. Amarasinghe. StreamIt: A compiler for streaming approach to register allocation. ACM Trans. Program. Lang. Syst.,12 applications, 2001. MIT-LCS Technical Memo TM-622. (4):501–536, 1990. [22] Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao [4] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Tang, and Quan Hoang Nguyen. Optimizing scientific application and Jung-Ho Ahn et al. Merrimac: Supercomputing with streams. In loops on stream processors. In LCTES ’08: Proceedings of the 2008 SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercom- ACM SIGPLAN-SIGBED conference on Languages, compilers, and puting, page 35. IEEE Computer Society, 2003. tools for embedded systems, pages 161–170. ACM, 2008. [5] Abhishek Das, William J. Dally, and Peter Mattson. Compiling for [23] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry stream processing. In PACT ’06: Proceedings of the 15th inter- Husbands, and Katherine Yelick. The potential of the cell processor for national conference on Parallel architectures and compilation tech- scientific computing. In CF ’06: Proceedings of the 3rd conference on niques, pages 33–42, New York, NY, USA, 2006. ACM. Computing frontiers, pages 9–20, New York, NY, USA, 2006. ACM. [6] Janet Fabri. Automatic storage optimization. SIGPLAN Not., 14(8): [24] Nan Wu, Mei Wen, Ju Ren, Yi He, and Chunyuan Zhang. Register 83–91, 1979. ISSN 0362-1340. allocation on stream processor with local register file. In ACSAC ’06: [7] Lal George and Andrew W. Appel. Iterated register coalescing. ACM Proceedings of the 11th Asia-Pacific Computer Systems Architecture Trans. Program. Lang. Syst., 18(3):300–324, 1996. Conference, pages 545–551, 2006. [8] Jordan Gergov. Algorithms for compile-time memory optimization. [25] Xuejun Yang, Xiaobo Yan, Zuocheng Xing, Yu Deng, Jiang Jiang, and In SODA ’99: Proceedings of the tenth annual ACM-SIAM symposium Ying Zhang. A 64-bit stream processor architecture for scientific ap- on Discrete algorithms, pages 907–908, Philadelphia, PA, USA, 1999. plications. In ISCA ’07: Proceedings of the 34th annual international Society for Industrial and Applied Mathematics. symposium on Computer architecture, pages 210–219. ACM, 2007. [9] Martin Charles Golumbic. Algorithmic and Perfect [26] Xuejun Yang, Ying Zhang, Jingling Xue, Ian Rogers, Gen Li, and Graphs (Annals of Discrete Mathematics, Vol 57). North-Holland Guibin Wang. Exploiting loop-dependent stream reuse for stream pro- Publishing Co., Amsterdam, The Netherlands, The Netherlands, 2004. cessors. In PACT ’08: Proceedings of the 17th international confer- [10] R. Govindarajan and S. Rengarajan. Buffer allocation in regular ence on Parallel architectures and compilation techniques, pages 22– dataflow networks: An approach based on coloring circular-arc graphs. 31, 2008. In HIPC ’96: Proceedings of the Third International Conference on High-Performance Computing (HiPC ’96), page 419, 1996. [11] H. A. Kierstead. A polynomial time approximation algorithm for dynamic storage allocation. Discrete Math., 87(2-3):231–237, 1991.

120