Exploiting Vector Parallelism in Software Pipelined Loops

Exploiting Vector Parallelism in Software Pipelined Loops Samuel Larsen, Rodric Rabbah and Saman Amarasinghe MIT Computer Science and Artificial Intelligence Laboratory fslarsen,rabbah,[email protected] Abstract tion, the performance advantage is realized with moderate architectural complexity and cost. An emerging trend in processor design is the addition Short vector instructions are predominantly geared to- of short vector instructions to general-purpose and embed- ward improving the performance of multimedia and DSP ded ISAs. Frequently, these extensions are employed us- codes. However, today’s vector extensions also afford a sig- ing traditional vectorization technology first developed for nificant performance potential for a large class of data par- supercomputers. In contrast, scalar hardware is typically allel applications, such as floating-point and scientific com- targeted using ILP techniques such as software pipelin- putations. In these applications, as in multimedia and DSP ing. This paper presents a novel approach for exploiting codes, a large extent of the processing is embedded within vector parallelism in software pipelined loops. The pro- loops that vary from fully parallel to fully sequential. i posed methodology ( ) lowers the burden on the scalar re- Loop-intensive programs with an abundance of data par- sources by offloading computation to the vector functional allelism can be software pipelined, essentially converting ii units, ( ) explicitly manages communication of operands the available parallelism to ILP. Software pipelining over- iii between scalar and vector instructions, ( ) naturally han- laps instructions from different loop iterations and derives iv dles misaligned vector memory operations, and ( ) par- a schedule that attempts to maximize resource utilization. tially (or fully) inhibits the optimization when vectorization Without explicit instruction selection that vectorizes opera- will decrease performance. tions, a machine’s vector resources are unused and software Our approach results in better resource utilization and pipelining cannot fully exploit the potential of a multimedia allows for software pipelining with shorter initiation inter- architecture. vals. The proposed optimization is applied in the compiler Compilers that target short vector extensions typically backend, where vectorization decisions are more amenable employ technology previously pioneered for vector su- to cost analysis. This is unique in that traditional vectoriza- percomputers. However, traditional vectorization is not tion optimizations are usually carried out at the statement ideal for today’s microprocessors since it tends to dimin- level. Although our technique most naturally complements ish ILP. When loops contain a mix of vectorizable and non- statically scheduled machines, we believe it is applicable to vectorizable operations, the conventional approach gener- any architecture that tightly integrates support for instruc- ates separate loops for the vector and scalar operations. In tion and data level parallelism. We evaluate our method- the vectorized loops, scalar resources are not well used, and ology using nine SPEC FP benchmarks. In comparison in the scalar loops, vector resources remain idle. In mod- to software pipelining, our approach achieves a maximum ern processors, a reduction in ILP may significantly degrade speedup of 1:38×, with an average of 1:11×. performance. This is especially problematic for VLIW processors (e.g., Itanium) because they do not dynamically re- order instructions to rediscover parallelism. 1. Introduction In this paper, we show that targeting both scalar and vector resources presents novel problems, leading to a new al- Increasingly, modern general-purpose and embedded gorithm for automatic vectorization. We formulate these processors provide short vector instructions that operate on problems in the context of software pipelining, with an em- elements of packed data [11, 14, 23, 24, 29, 37]. Vec- phasis on VLIW processors. Better utilization of both scalar tor instructions are desirable because the vector functional and vector resources leads to greater overlap among itera- units operate on multiple operands in parallel. Thus, vec- tions, thus improving performance. Our approach remains tor instructions increase the amount of concurrent execution cognizant of the loop’s overall resource requirements and while maintaining a compact instruction encoding. In addi- selectively vectorizes only the most profitable data parallel 1 computations. As a result, the algorithm effectively bal- 2. Motivating Example ances computation across all machine resources. The goal of selective vectorization is to divide operations We use the dot product in Figure 1(a) to illustrate the between scalar and vector resources in a way that max- potential of selective vectorization. The data dependence imizes performance when the loop is software pipelined. graph is shown in part (b). For clarity, we omit address Conventional strategies vectorize all data parallel opera- calculations. Consider a target architecture with three is- tions. In loops with a large number of vector operations, sue slots as the only compiler-visible resources, and single- this can leave scalar resources idle. Moving some opera- cycle latencies for all operations. A modulo schedule for tions to the scalar units can provide a more compact sched- the loop is shown in part (c). In a modulo schedule, the ini- ule. In other situations, full vectorization may be appropri- tiation interval (II) measures the constant throughput of the ate. This could occur when the overhead of transferring data software pipeline. In the schedule of part (c), two cycles are between vector and scalar resources negates the benefit of needed to execute four instructions, resulting in an II of 2.0. vectorization. Alternatively, it may be advantageous to omit Often, reductions similar to that shown in Figure 1 vectorization altogether in loops with little data parallelism. are vectorizable using multiple partial summations that are The most efficient choice depends on the underlying archi- combined when the loop completes. Since this scheme re- tectural resources, the number and type of operations in the orders the additions, it is not valid in all cases (e.g., with loop, and the dependences between them. floating point data). For this example, assume paralleliza- Selective vectorization is further complicated when the tion of the reduction is illegal, preventing vectorization of compiler is responsible for satisfying complex scheduling the add. requirements. Most architectures provide a set of heteroge- Now consider an extension to the example architecture neous functional units. It is not unusual for these units to that allows for the execution of one vector instruction each support overlapping subsets of the ISA. Furthermore, the cycle, including vector memory operations. Assume vector compiler may be responsible for multiplexing shared re- instructions operate on vectors of length two. In the face sources such as register file ports and operand networks. of loop carried dependences, a traditional vectorizing com- Also, scalar and vector operations may compete for the piler distributes a loop into vector and scalar portions, as same resources. This is usually the case for memory op- shown in part (d). Scalar expansion is used to communicate erations since the same resources execute vector and scalar intermediate values through memory. versions. Since the processor can issue only one vector operation each cycle, modulo scheduling cannot discover additional In this paper, we describe a union of ILP via software parallelism in the vector loop. Four cycles are needed to pipelining and DLP (data level parallelism) via short vec- execute four vector operations (two loads, one multiply, and tor instructions (Section 3). We demonstrate that even fully one store). This amounts to an initiation interval of 2.0, vectorizable loops benefit from selective vectorization (Sec- since one iteration of the vector loop actually completes two tion 4). Our algorithm operates on a low-level IR. This iterations of the original loop. The operations in the scalar approach is more suitable for emerging complex architec- loop can be overlapped so that an iteration completes each tures since the impact of the optimization is measured with cycle. Overall, this results in an initiation interval of 2 + respect to actual machine resources. A backend approach 1 = 3, which is inferior to the performance gained from also allows us to examine the interaction between vector- modulo scheduling alone. Even if the overhead of scalar ization and other backend optimizations, specifically soft- expansion is overlooked, vectorization cannot recover from ware pipelining. This leads to a natural combination of two the degradation of ILP due to loop distribution. techniques that are generally considered alternatives. To our A more effective approach is to leave the loop intact knowledge, no literature exists that proposes the partial vec- so vector and scalar operations can execute concurrently. torization we advocate in this paper. This strategy is illustrated in Figure 1(e). Here, an II of We have implemented selective vectorization in Tri- 1.5 is achieved since the kernel completes two iterations ev- maran, a compilation and simulation infrastructure for ery three cycles. Note that two scalar additions are needed VLIW architectures. Trimaran includes a large suite of op- to match the work output of the

Load more