US007873812B1

(12) United States Patent (10) Patent N0.: US 7,873,812 B1 Mimar (45) Date of Patent: Jan. 18, 2011

(54) METHOD AND SYSTEM FOR EFFICIENT 6,665,790 B1 12/2003 Glossner, III et al. MATRIX MULTIPLICATION IN A SIMD 6,959,378 B2 10/2005 Nickolls et al. PROCESSOR ARCHITECTURE 6,963,341 B1 11/2005 Mimar (75) Inventor: Tibet Mimar, 1040 Gloucester Ct., 7,159,100 B2 1/2007 Van Hook et al. Sunnyvale, CA (US) 94087 7,191,317 B1 3/2007 Wilson 7,308,559 B2 12/2007 Glossner et al. (73) Assignee: Tibet Mimar, Morgan Hill, CA (U S) 7,376,812 B1 5/2008 Sanghavi et al. ( * ) Notice: Subject to any disclaimer, the term of this 7,467,288 B2 12/2008 Glossner et al. patent is extended or adjusted under 35 7,546,443 B2 6/2009 Van Hook et al. U.S.C. 154(b) by 782 days. 2002/0198911 A1* 12/2002 Blomgren et al...... 708/232 2003/0014457 A1 1/2003 Desai et al. (21) Appl. No.: 10/819,059 2005/0071411 A1* 3/2005 Gustavson et al...... 708/514 (22) Filed: Apr. 5, 2004 (51) Int. Cl. G06F 15/76 (2006.01) OTHER PUBLICATIONS (52) US. Cl...... 712/22; 712/2; 712/7 Altivec Technology Programming Environment Manual, (58) Field of Classi?cation Search ...... 712/22, Altivecpem/D, Nov. 1998, Rev.0. WWW..com. See Permute 712/2, 4, 7, 9 See application ?le for complete search history. Instruction. (56) References Cited (Continued) U.S. PATENT DOCUMENTS Primary ExamineriAlford W Kindred Assistant Examinerilesse R Moll 4,712,1754,862,407 A * 12/19878/1989 Toriietal.Fette et al...... 712/35 ABSTRACT 5,511,210 A 4/1996 Nishikawaetal. (57) 5,513,366 A * 4/1996 Agarwalet al...... 712/22 5,555,428 A 9/1996 Radigan etal. 5,802,384 A 9/1998 Nakamura The neW system provides for ef?cient implementation of 5,832,290 A 11/1998 Gostin etal. matrix multiplication in a SIMD processor. The neW system 5,838,984 A 11/1998 Nguyen et al. provides ability to map any element of a source vector register 5,864,703 A 1/1999 van Hook et al. to be paired With any element of a second source vector 5,872,987 A 2/1999 Wade et al. 5,887,183 A * 3/1999 Agarwalet al...... 712/2 register for vector operations, and speci?cally vector multiply 5,903,769 A 5/1999 Arya and vector-multiply-accumulate operations to implement a 5,940,625 A 8/1999 Smith variety of matrix multiplications Without the additional per 5,973,705 A 10/1999 Narayanaswami mute or data re-ordering instructions. Operations such as 5,991,531 A * 11/1999 Song etal...... 703/26 DCT and Color-space transformations for video processing 5,991,865 A 11/1999 Longhenry et al. could be very e?iciently implemented using this system. 5,996,057 A 11/1999 Scales et al. 6,058,465 A 5/2000 Nguyen 6,530,015 B1 3/2003 Wilson 29 Claims, 15 Drawing Sheets

k mm: lNPUT souRcE OFERANDS—0

(Opnrninn) um

Disabl- - 44m

v-cw Element .| Ammum"

cmw Condition Vector Rlglmr mm

Condilicn sum mm owod‘ uxtlnxion

uPconE um Quin-1 ‘Bowel-2 run-3 Gammon swln e-mu s-llu s-ans sen: an, ‘an, ovconi US 7,873,812 B1 Page 2

OTHER PUBLICATIONS Anido et al. , A Novel Method for Improving the Operation Autonomy of SIMD Processing Elements, 2002, IEEE. Imaging and Compression Engine (ICE), Silicon Graphics, http:// “The Complete Guide to MMX Technology”, Intel, McGraw-Hill, futuretech.mirror.vuurwerk.net/ice.htrnl. Inc.; ISBN 0-07-006192-0, 1997, pp. 2-21. Altivec Instruction Cross Reference: http://developer.apple.com/ AltiVec Technology Programming Interface Manual, Freescale hardware/ve/instructionicro ssref.html. Semiconductor, Jun. 1999, Rev. 0. Intel SSE Instruction Set: http://www.tommesani.com/SSEPrimer. Ian Mapleson, “02 Architecture”, Dec. 4, 2007. htrnl. “Intel 64 and IA-32 Architectures Software Developer’s Manual”, “White Gaussian Noise Generator (Communication)”, Reference vol. 2B: Instruction Set Reference, N-Z, Dec. 2009, pp. 4-236 Design Library, Xlinx System Generator v6.3 User Guide, 2004, 3 through 4-250. Pages. “Power PC Microprocessor Family: Vector/SIMD Multimedia Klimasauskas, Casimir C., “Not Knowing Your Random Number Extension Technology Programming Environments Manual”, ver Generator Could Be Costly: Random GeneratorsiWhy Are They sion 2.06, Aug. 22, 2005, pp. 244-249. Important”, PCAI (PC Arti?cial Intelligence), vol. 163, 2002, pp. Donald E. Knuth, “The Art of Computer Programing”, vol. 3, 52-58. Sorting and Searching, Second Edition, 1998, Addison-Wesley, p. 221. Paar et al., “A Novel Predication Scheme for a SIMD System-on Chip”, 2002. * cited by examiner US. Patent Jan. 18, 2011 Sheet 1 0115 US 7,873,812 B1

16-Bit Element

VRs-10123"567

Standard v \ v 1 v \ v \ Element-To-Element OP OP OP OP OP 0P OP 0P0-_) O 1 2 3 " 5 6 7 VRD Mode A A A A

VR$.20123"567

VRs-10123"567

35:22:21,223, 0:00: 0: 0 0 0 0-00 1 0 0 0 0 Y

VRs.2 012 3

Figure 1 US. Patent Jan. 18, 2011 Sheet 2 0f 15 US 7,873,812 B1

200

Element Element Element Element Control 0 1 "" J "" N .1 Vector Register SEL-O 3&4 5&4; SEL-(N-1) CONTROL CONTROL CONTROL CONTROL 0

Element Element Element Element source'1 0 1 ' ' ' J ' ' ' N -1 Vector Register

230

V V V V VECTOR ELEMENT MAPPING LOGIC 250 251 V V V Vector OP OP OP OP Computational Unit A A A A l I I 240 / VECTORI ELEMENT MAPPINGI LOGIC / I I

Elem /t Elem /nt Elem /nt Eleme‘l t Source-2 o 1 . J . . . N -1 Vector Register

Element Element Element Element HIGH 0 1 _ _ _ J _ _ _ N -1 Element Element Element Element Vector MIDDLE 0 1 _ J _ . . N -1 Accumulator Element Element Element Element LOW 0 1 J N _1 270

C C C C Clamp Logic (Saturation) 280

V V V V Element Element Element Element Destination 0 1 - - - J - - - N '1 Vector Register

Figure 2 US. Patent Jan. 18, 2011 Sheet 3 0f 15 US 7,873,812 B1

Element Element Element Element Source #1 0r#2 0 1 J N-1 Vector Register

MAPPING LOGIC

0 (From : Opcode) ' Format

VV V V V V! V V V VV V V I IV V V V i SELOl ‘ SEL-1l SEL-Jl SEL-(N-1) FORMATLOGIC Output-O Output-1 Output-J Output-(N-1) + From Control Vector Register

Fcrmat Format Loglc Output (From Opcode) Sel-O Sel-1 Sel-J Sel-(N-1) Control Control Control Control Respectlve O 1 J N4 Elements Element “K” Broadcast K K K K (Source #2 Only) AnzfmmemAn -E|ement to vRS'3[O]13..8 VR$-3[1]13..s vRe-3[J]13._8 VRs-3[N-1]13__8 Table: Operation of Format Logic (Shown for mapping of VFls-2)

Figure 3 US. Patent Jan. 18, 2011 Sheet 4 0f 15 US 7,873,812 B1

.— PAlRED lNPUT SOURCE OPERANDS—O Mapped Source-1 Mapped Source-2 Vector Register Vector Register For For Element “J” Position Element “J” Position 251

ALU and 4 MULTlPLlER E|ement_J / OP (Operation) Unit 1’6 S , > E L —)

Disable = ~Mask

r Vector Element J Accumulator

Element-J Element J V t°°'§'°'. t Condition ec or egls er Flags

E|ement J Vector Condition Flag Register Condition Select from opcode extension

OPCODE Source-1 Source -2 Source-3 Format Condition G-Bits 5-Bits 5- Bits S'MD OPCODE

Figure 4 US. Patent Jan. 18, 2011 Sheet 5 0f 15 US 7,873,812 B1

255 0

VRO -- VR31 VECTOR REGISTER FILE

VRAO - VRA31 ALTERNATE VECTOR REGISTER FILE

255 0 Accumulator - High 16-bit/Element Vector Accumulator - Middle 16-bit/Element Vector Accumulator - Low 16-bit/Element Vector

255 0 Vector Condition Flags (VCF): 16 Flags Per Element

Figure 5 US. Patent Jan. 18, 2011 Sheet 6 of 15 US 7,873,812 B1

VMUL Vector Multiply (SIMD Processor)

OPCODE Source-1 Source-2 Source-3 Format Condition 6-Bits 5-Bits 5-Bits

Format: VMUL [.CO] VFtd, VRs-t, VFts-2 VMUL [.CC] VRd, VRs-t, VFis-2 [element] VMUL [.CC] VRcl, VRs-t, VFts-2, VRs~3 Description: Multiplies two vector registers and stores the result in vector-accumulate registers, as well as the saturated result is stored in a vector destination register.

Operation: N is the number of processor elements in SIMD processor.

for (l = O; l < N; i++) { Case (Format) { Default_Mapping: //Ftespective Element-To-Element Mode Map__Source_1 : i; Map__S0urce__2 = i; Mask = O; One_Element__Broadcast: Map_Source_1 = i; Map__Source__2 =VRs-3 Opcode Field; Mask = 0; Any_Element_To_Any_Element_Mapping: Map_Source_1 = VRs-3[i]5,,0; MapiSource_2 = VRs-3[i]13,8; Mask == VHS-315; } If (~Mask && VCF[i][Condition]) Paired_Source_1 : VHS-3M6 ? O:VRs-1[Map?Source_1]; Paired_Souroe_2 : VRs-3[i]14? 0: VRs-2[Map__Source__2];

VACCU] é (Paired_Source_1 * Paired_Source_2])31,_O; VRd[i] <- Signed-Clamp (VACC[i]); }

Figure 6 US. Patent Jan. 18, 2011 Sheet 7 0f 15 US 7,873,812 B1

VMAC Vector Multiply-Accumulate (SIMD Processor)

OPCODE Source-1 Source-2 Source-3 Format Condition G-Bits 5-Bits 5-Bits

Format: VMAC [.CC] VRd, VRs-1, VRs-2 VMAC [.CC] VRd, VRs-l, VRs-2 [element] VMAC [.CC] VRd, VRs-l, VRs-2, VRs-S Description: Multiplies two vector registers and accumulates the sign-extended result in vector accumulate registers, as well as the saturated result is stored in a vector destination register.

Operation: N is the number of processor elements in SIMD processor.

for (i = O; l < N; l++) Case (Format) { Default_Mapping: //Respective Element-To-Element Mode Map_Source_1 = i; Map_Source_2 = i; Mask = 0; One_Element_Broadcast: Map_Source_1 = i; Map_Source_2 =VRs-3 Opcode Field; Mask = 0; Any_Element__To_Any_Element_Mapping: Map_Source_1 : VRs-3[i]5__O; Map__Source_2 : VRs-3[i]13__8; Mask = VHS-315; } if (~Mask && VCF[i][Condition]) { Paired_Source_1 = VRs-3[i]6 ? 0: VRs-1[Map_Source_1]; Paired_Source_2 = VRs-3[i]14? O: VRs-2[Map_Source_2];

VACC[i] <- VACC[i] + (Paired_Source_1 * PairedwSource_2])31_,0; VRd[i] 6 Signed-Clamp (VACC[i]);

Figure 7 US. Patent Jan. 18, 2011 Sheet 8 0f 15 US 7,873,812 B1

Q0 M0 M4 M8 M12 Q1 M1 M5 M9 M13 Q2 M2 M6 M10 M14 Q3 M3 M7 M11 M15 5535

Figure 8 US. Patent Jan. 18, 2011 Sheet 9 0115 US 7,873,812 B1

Output Step # 1 Step # 2 Step # 3 Step # 4 Q0 M0 * 10 + (M4 *11) + (M8 *12) + (M12 * I3) Q1 M1 * 10 + (M5 *11) + (M9 *12) + (M13 *13) Q2 M2 * 10 + (M6 * 11) + (M10 * I2) + (M14 *13) Q3 M3 * 10 + (M7 *11) + (M11 12) + (M15 *13) Table: Multiplication of 4 x 4 Matrix with a 4 x 1 Matrix.

Figure 9

US. Patent Jan. 18, 2011 Sheet 13 of 15 US 7,873,812 B1

Format . (From lnstructlon Syntax Mode Opcode)

R espectlve. . VRd, VRs-l, VRs-2 Standard Elemems < Vector Inst >. VRd, VRs-l, VRAs-2 Standard W/Alternate Element “K” Broadcast < Vector Inst >. VRd, VRs-l, VRs-2[ele1nent] Broadcast (Source-2 Mode Only) Any-Element to < Vector Inst > VRd VRs-l VRs-Z VRs-3 Full Any-Element ' ’ ’ ’ Mapping Mapping Table: Vector Instruction Format Groups. Figure 13 US. Patent Jan. 18, 2011 Sheet 14 0115 US 7,873,812 B1

‘E400

Element Element Element Element Control 0 1 " ' J ' " N -1 Vector Register

SEL-O SEL-1 $E|__j $EL-(N-1) CONTROL CONTROL CONTROL CONTROL 0

Element Element Element Element source‘? 0 1 ' ' ' J ' ' ' N -1 Vector Register

1430

v v v v VECTOR ELEMENT MAPPING LOGIC 1450

v " Vector OP 0P OP OP Computational Unit A A A A l l I 1440 / VECTORI ELEMENT MAPPINGl LOGIC / / l lI k /I / 1420

Elem nt Elem nt Elem nt Eleme t Source-2 0/ 1/ I I I J/ I I I N '1/ 1470

V V + + ('3 (a Vectoéétdder, AP’ AT“ A:- “- Pass Through 1460

V V V V Element Element Element Element Destination 0 1 . . . J . . . N -1 Vector Register

Figure 14 US. Patent Jan. 18, 2011 Sheet 15 0115 US 7,873,812 B1

To Memory Interface

1515 1514

Write Port - Read Port - Write Port ' VECTOR REGISTER FILE: VRO-VR31

1500 I I I I

1256

Vector Mask 230 16

25° VECTOR OPERATION UNiT ' . (ALU & Multipliers)

_ ACCUMU_ VECTOR LATOR

Clamping 270

Figure 15 US 7,873,812 B1 1 2 METHOD AND SYSTEM FOR EFFICIENT elements of a second vector register. Intra element operation MATRIX MULTIPLICATION IN A SIMD and broadcast modes of SIMD operation of prior art are PROCESSOR ARCHITECTURE shoWn in FIG. 1. Some processors like the AltiVec provides a non-arith CROSS-REFERENCE TO RELATED metic vector permute instructions, Whereby vector elements APPLICATIONS could be permuted prior to arithmetic operations. HoWever, this requires tWo vector-permute instructions folloWed by the This application is related to co-pending US. patent appli vector arithmetic instruction, Which reduces the ef?ciency cation Ser. No. 10/357,805, ?led on Feb. 3, 2003 by Tibet signi?cantly for core operations such as DCT or color space Mimar entitled “Flexible Vector Modes of Operation”, and to conversion. It is conceivable that a SIMD processor has mul co-pending US. patent application Ser. No. 10/441 ,336, ?led tiple vector units, Where tWo vector permute units and vector on May 20, 2003 by Tibet Mimar entitled “Method For Ef? arithmetic unit could operate concurrently in a super scalar cient Handling Of Vector High-Level Language Conditional processor executing tWo or more vector instructions in a Constructs InA SIMD Processor”. pipelined fashioned. HoWever, this requires the results of one instruction to be immediately available to the other unit by BACKGROUND OF THE INVENTION bypassing of intermediate vector results before these are Writ ten to the vector register ?le. Such bypassing of intermediate 1. Field of the Invention results becomes increasingly costly in terms of die area or The invention relates generally to the ?eld of processor number of gates, as the number of SIMD elements is chips and speci?cally to the ?eld of Single-Instruction Mul 20 increased over eight elements or 128-bit Wide data paths. tiple-Data (SIMD) processors. More particularly, the present The folloWing describes hoW the above tWo example appli invention relates to matrix operations in a SIMD processing cations are handled by prior art. system. 2. Background Color Space Conversion Matrix multiplication operations are used often in many 25 A transformation betWeen color spaces requires multipli digital signal-processing operations, because linear opera cation of a 4x4 matrix with 4x1 input color matrix, as shoWn tions could be expressed as matrix operations. For example, beloW converting input vector {10, I1, I2, I3} to output vector 4><4 transformation matrix is used to perform any color space of {Q3, Q2, Q1, Q0}, as shoWn in FIG. 8. The values of 4x4 to another color space transformation. The four-color com matrix, M0 to M15, depend on the color space conversion ponents could be Red, Green, Blue, and Alpha, or any three 30 being performed and consist of constant values. First, let us color component types and an Alpha channel typically used assume We have a 4-Wide SIMD processor, and We are trying for keying applications. Color space transformation is used to calculate this matrix. First, We could pre-arrange the con often in video, image and graphics applications. Another stant values of 4x4 matrix so that values are in column application of the matrix multiplication is in the implemen sequential, as We have done in this example. One could imple tation of Discrete Cosine Transform (DCT) for H.264 video 35 ment the matrix operations in tWo different Ways. The classic compression standard (also knoWn as AVC and MPEG-4 Part approach is to multiply the ?rst roW of matrix With the ?rst 10), Which requires multiplication of multiple 4><4 matrices. column of second matrix (only one in this case) and vector Existing SIMD processor architectures try to use SIMD sum the results to obtain the ?rst value. Similarly, second instructions to speed up the implementation of these matrix value is obtained by multiplying second roW of 4x4 With the operations by paralleliZing multiple operations. SIMD 40 input vector and vector-summing the intermediate results. instructions Work on vectors consisting of multiple elements The second approach, that is more appropriate for SIMD at the same time. An 8-Wide SIMD could operate on eight implementation, is to calculate the partial sums of four output 16-bit elements concurrently, for example. Even though values at the same time. In other Words, We Would ?rst mul SIMD architecture and instructions could be very ef?cient in tiply I0 With all the values of the ?rst column of 4x4 matrix implementing Finite Impulse Response (FIR) ?lters and other 45 and store this vector of four values in a vector accumulator operations, the implementation of matrix operations are not register. Next, We Would multiply I1 With all the second e?icient because matrix operations require cross mapping of column values of 4x4 matrix, and add this vector of four vector elements for arithmetic operations. SIMD processors values to the vector accumulator values, and so forth for the that are on the market at the time of this Writing Work With remaining tWo columns of 4x4 matrix. This partial-product intra vector elements, Which mean one-to-one ?xed mapping 50 approach minimiZes inter-element operations, and avoids the of vector elements are used for vector operations. For use of vector-sum operations. Column sequential ordering of example, When We use vector-add instruction, this instruction 4x4 matrix alloWs us to load vectors of {m0, m1, m2, m3}, adds element-0 of source-1 vector register to element-0 of {m4, m5, m6, m7}, and others as vector load of sequential source-2 vector register and stores it to element-0 of output data elements into a vector register Without the need to re vector register; adds element-1 of source-1 to element-1 of 55 shuffle them. The arithmetic operations to be performed are source-2 and stores it to element-1 of output vector register, shoWn in the table shoWn in FIG. 9 and so forth. In contrast, inter-mapping of source vector ele A 4-Wide SIMD could calculate such matrix operations ments refers to any cross mapping of vector elements. Most very e?iciently, assuming there is a broadcast mode Where SIMD processors provide little or no support for inter-ele one element could operate across all other elements of a ment arithmetic operations. Motorola’s AltiVec provides 60 second vector. A someWhat less e?icient approach used by only tWo inter element vector instruction for arithmetic opera SIMD processors is to ?rst use a “splat” instruction, Which tions: sum of products and sum across (vector sum of ele operates to copy any element from one vector register into all ments). The other SIMD processors, such as MMX/ SSE from elements of another register. This Would reduce the perfor Intel, or ICE from SGI does not even provide support for inter mance by 50 percent in comparison to a SIMD With broadcast element operations other than vector sum operations. Silicon 65 mode, since all multiply or multiply-accumulate operations Graphics’ ICE provides a broadcast mode, Whereby one has to be proceeded by a splat vector instruction. MMX from selected element of a source vector could operate across all Intel does not have such an instruction, and therefore, it takes US 7,873,812 B1 3 4 four instructions, Whereas SSE from Intel has a packed the mapping of elements required for multiplication or mul shuf?e instruction, Which could implement the splat opera tiply-accumulation cannot be pre-arranged, as in the previous tion. In summary, other than 50 percent performance reduc example, and requires a cross mapping of inter elements. Let tion due to splat operation, 4-Wide SIMD processors could us take the case of Q:X~B, and let us assume that all input and perform matrix operations about tWice as fast as their scalar output matrices are stored in roW-sequential order. counter parts. But, the performance of these processors are This matrix multiplication could be done in four steps, as not close to meeting high performance requirements of video, illustrated in FIG. 12, using the multiply accumulate feature graphics, and other digital signal processing applications, of a SIMD processor to accumulate partial results. The ?rst because their level of parallelism is small due to small number step is a multiply operation of tWo vectors: of parallel processing elements. Furthermore, AltiVec and {X0, X0, X0, X0, X4, X4, X4, X8, X8, X8, X8, X12, X12, Pentium class processors do not meet the loW cost and loW X12, X12} and poWer requirements of embedded consumer applications. Let us look at an 8-Wide SIMD to exemplify What happens When SIMD processor has higher potential “data crunching” poWer. In this case, We could process tWo input pixels at the This step Would require a single vector-multiple instruction, same time, i.e., We Would load eight values {I0, I1, I2, I3, I4, if We had the input vectors mapped as required, but this I5, I6, I7} Where I0-3 is color components of the a pixel and mapping Would require tWo additional vector-permute I4-7 is the color components of adjacent pixel. In this case, We instructions. The second step is a vector multiply-accumulate could precon?gure 4><4 matrix columns as {m0, m1, m2, m3, operation of the folloWing tWo vectors: m0, m1, m2, m3} and so forth, and prestore these in memory 20 {X1, X1, X1, X1, X5, X5, X5, X5, X9, X9, X9, X9, X13, so that We could access any column as a vector of eight values X13, X13, X13} and by a single vector load instruction. This is equivalent to mul tiplying the same 4><4 matrix With an input matrix of 4x2, resulting in a 4x2 matrix. FIG. 10 illustrates the operations in This step could also be done With a single vector-multiply four steps of partial-product accumulation. The problem is, 25 hoWever, hoW We form a vector of{I0, I0, I0, I0, I4, I4, I4, I4} accumulate instruction, if We had the input matrix elements from input vector of{I0, I1, I2, I3, I4, I5, I6, I7}. Intel’s MMX already mapped as folloWs. To get this mapping, the prior art and SSE do not offer any direct help to resolve this, not to such as AltiVec from Motorola requires tWo additional vec mention they are also 4-Wide SIMD. Motorola’s AltiVec Vec tor-permute instructions. tor permute instruction could be used, but again this reduces 30 The requirement of tWo additional vector permute instruc performance by 50 percent, because during the permute step tions reduces the performance to about one third, because the arithmetic unit remains idle, or We face complicated and duty cycle of arithmetic operations Where vector arithmetic costly bypassing of intermediate variables betWeen concur unit is used is one-third. A super scalar implementation Would rent vector computational units. require tWo concurrent vector permute units, in parallel With As shoWn in FIG. 11, multiplying matrix A 1110 and 35 the vector arithmetic unit, Which also Would complicate the matrix B 1120 may be calculated by ?rst decomposing matrix intermediate bypassing due to Wide data Widths, for example for a 32-Wide SIMD using a 512-bits Wide data path. A scalar A to its columns, and matrix B to its roWs, and performing a series of matrix multiplication of columns of A With respec processor typically bypasses results so that result of one tive roWs of B and summing the partial results. The columns instruction becomes immediately useable by the folloWing of matrix A 1110 is shoWn by A1 1130, A2 1131, and An 40 instruction. Even though each instruction takes several cycles 1132. Similarly, roWs ofB 1120 are decomposed as B1 1160, to execute in a pipeline fashion, such as Instruction Read, B2 1161, and Bn 1162. Then, matrix multiplication of A and Instruction-Decode/ Register Read, Execute- 1, Execute-2 and B matrices could be expressed summation of A1-B1 1170, Write-Back stages in a ?ve stage pipelined RISC processor, A2-132 1171, and An-Bn 1172, Where symbol “~” denotes this becomes costly With regard to gate count in a Wide data matrix multiplication. Matrix multiplication of a column 45 path processor such as SIMD. That is Why SIMD processors matrix of Ax With a roW matrix Bx, Where “x” denotes column and some VLIW processors from Texas Instruments or roW number before decomposition, generates an m-by-p (TMS3206000 series) do not employ bypass, and it is left to matrix 1180. For above example of 4-by-4A matrix 1110 and the user to put enough instruction padding, by the use of 4-by-2 B matrix, output matrix has 8 elements. Calculating 8 interleaving instructions or NOPs, to take care of this. of these parallel Will require an 8-Wide SIMD, Which could 50 iterate over four columns to calculate the result in four clock SUMMARY OF THE INVENTION cycles. HoWever, such a calculation is not possible With that ef?ciency because pairing of input vectors requires the fol The present invention provides a method by Which any loWing mapping of input vectors: element of a source-1 vector register may operate in combi 55 nation With any element of a source-2 vector register. This {1,2,...4,1,2,...,4}and{1,1,...,1,2,2,...,2}.The provides the ultimate ?exibility in pairing vector elements as requirement for mapping is further compounded for Wider inputs to each of the arithmetic or logical operation units of a SIMD architectures and for larger input matrices. , such as a SIMD processor. This ?exible DCT Calculation mapping of any inter element support provides support to H.264 video compression standard frequently uses a DCT 60 implement matrix multiplications very e?iciently. A 4><4 based transform for 4x4 blocks in residual data. This trans matrix multiplication requires only one vector multiply and formation requires the calculation of the folloWing operation: three vector multiply-accumulate instructions in a 16-Wide embodiment. H.264 requires multiplication of tWo 4><4 matri Y:A~X~B Where A and B are constant 4><4 matrices and X is ces for DCT transform, Which could be implemented using a the input as a 4x4 data matrix. We could ?rst calculate A-X or 65 total of only eight vector arithmetic instructions, as opposed X~B, because this is a linear operation. Since X matrix is the to the 24 vector instructions of prior art. The mapping of input input data, it is typically in roW-sequential order. This means elements is controlled by a third vector source register, US 7,873,812 B1 5 6 wherein certain bit-?elds Within each element of the control DETAILED DESCRIPTION vector register select the source mapping for both source vectors. The present invention decomposes a ?rst matrix to its col umns and a second matrix to its roWs, and performs matrix BRIEF DESCRIPTION OF THE DRAWINGS 5 multiplication of respective columns and roWs. Vector multi ply is used as the ?rst step for ?rst column of said ?rst matrix The accompanying drawings, Which are incorporated and and said ?rst roW of second matrix, and vector multiply form a part of this speci?cation, illustrate prior art and accumulate operation is used for the rest of the respective embodiments of the invention, and together With the descrip columns and roWs. In each step, elements of column of ?rst tion, serve to explain the principles of the invention: matrix and elements of roW of second matrix is mapped Prior Art FIG. 1 illustrates an example of one-to-one and according to matrix multiplication requirements. broadcast modes of vector operations, that is, operations The present invention uses a control vector register to betWeen vector elements as implemented by a prior art SIMD control the mapping of one or both source vector elements on processor. Both, one-to-one operations betWeen correspond a per-element basis, as shoWn in FIG. 2. For each element ing vector elements of the source vector registers, and, opera 5 position of the output vector, different bit-?elds of the corre tions Where one element of a source vector register is broad sponding element of control vector register 200 provides cast to operate in combination across all the elements of the mapping for a given vector instruction. There are tWo-vector other source vector register, are illustrated in this ?gure. mapping logic blocks, one for each of the tWo source vectors. FIG. 2 shoWs the method of present invention, Where the The mapping logic 230 maps elements of source-1 vector mapping of elements of both source vectors is ?exible. Paired 20 register 210, and the mapping logic of 240 maps elements of inputs to vector computational unit from tWo source inputs of source-2 vector register 220. Thus, units 230 and 240 consti source-1 vector register and source-2 vector register are tute a means for mapping of a ?rst and second source vector mapped from any element of each of the tWo source vector elements. The outputs of mapping logic 230 and 240 consti elements. tute paired source vector elements as inputs to each of plu FIG. 3 illustrates the details of vector mapping logic block 25 rality of computational blocks of vector computational unit that is used tWice in FIG. 2. Multiplexers are used to select one 250. Each computational unit (shoWn as “OP”, or OPeration) of the source vector elements based on a control ?eld for each 251 can perform arithmetic or logical operations on said of the vector elements provided in a speci?ed vector control paired source vector elements. The output of vector compu register. Zero value could be sourced for any source vector tational unit is stored or added to respective elements of element. 30 vector accumulator 260, in accordance With a given vector FIG. 4 illustrates per-vector-element Condition ?ag and operation. The contents of vector accumulator elements are Mask Control of SIMD Operations, that is, the operation of clamped by clamp logic 270 and stored into selected destina enable/disable bit control and condition code control of vec tion vector register 280 in one embodiment. As We have seen tor operations. The condition ?ags and mask control are not in the above examples, ef?cient multiplication of matrices used for the matrix multiplication, but is, in general, part of 35 requires mapping of both source vectors. Each vector element the preferred embodiment of present invention for other mapping logic consists of multiple select logic, as shoWn in SIMD operations. FIG. 3. An option to override a Zero value is also provided for FIG. 5 shoWs the programming model of vector registers other vector operations of the preferred embodiment of for the preferred embodiment. present invention, but is not really needed for matrix opera FIG. 6 shoWs the details of Vector-Multiply instruction of 40 tions. Each select unit, SEL-0 through SEL-(N-l) can select the preferred embodiment. any one of the elements of a source vector. One of three FIG. 7 shoWs the details of Vector-Multiply-Accumulate operating modes is selected by the format logic that controls instruction of the preferred embodiment. these select logic units to map elements of source vectors to FIG. 8 shoWs matrix multiplication required for the general be paired in accordance With a instruction format de?ned by case of color space conversion. a vector instruction and also control vector register element FIG. 9 shoWs implementation of multiplication of 4x4 values. Table of FIG. 3 shoWs three different instruction for matrix With a 4x1 matrix. mats: prior art respective elements of tWo source vectors and FIG. 10 shoWs performing tWo matrix multiplications at one element to be broadcast to all element positions (both the same time in four steps or multiplication of 4x4 matrix shoWn in FIG. 1), and the general mapping of any-element of With a 4x2 matrix. ?rst source vector to be paired With any-element of second FIG. 11 shoWs decomposition of m-by-n matrix and n-by-p source vector. The opcode coding of a vector instruction determines Which of the instruction formats is chosen. matrix into n column matrices and n roW matrices, and matrix multiplication of respective column matrices With roW matri The preferred embodiment maps the ?elds of each control vector element as folloWs to control the select logic for each ces. vector element: FIG. 12 shoWs mapping requirements of multiplication of Bits 5-0: Select source element fromVRs-l vector register; 4x4 matrix With another 4><4 matrix implementation in a Bit 6: When set to one, selects Zero for VRs-1 16-Wide SIMD processor. Bit 7: Reserved. FIG. 13 shoWs possible instruction formats of one embodi Bits 13-8: Select source element from VRs-2 vector regis ment of present invention. 60 FIG. 14 shoWs another embodiment of present invention ter; Bit 14: When set to one, selects Zero for VRs-2 Without a separate vector accumulator, Where destination vec tor register also acts as the vector accumulator. Bit 15: Mask bit, When set to one disables Writing output for that element. FIG. 15 shoWs a high-level vieW of embodiment shoWn in FIG. 2 With a vector accumulator, Where all source and des This mapping supports up to 64 vector elements, i.e., 64-Wide tination vector registers are stored in a vector register ?le With SIMD. If there are feWer elements, then the unused bits could multiple ports. be reserved as Zero for possible future extension. FIG. 4