US 20140176569A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2014/0176569 A1 Meixner (43) Pub. Date: J 1111. 26, 2014

(54) EMPLOYING Publication Classi?cation A STANDARD PROCESSING UNIT AND A (51) Int Cl METHOD OF CONSTRUCTING A GRAPHICS Got-3T 00 (200601) PROCESSING UNIT H05K 13/00 (2006.01) . _ (52) US. Cl. (71) Apphcam- CORPORATION, Santa CPC . G06T 1/20 (2013.01); H05K 13/00 (2013.01) Clara, CA (Us) USPC ...... 345/501; 29/592 (72) Inventor: Albert Meixner, Mountain View, CA (57) ABSTRACT (US) Employing a general processing unit as a programmable function unit of a and a method of manu facturing a graphics processing unit are disclosed. In one (73) Assignee: NVIDIA CORPORATION, Santa embodiment, the graphics pipeline includes: (1) accelerators, Clara, CA (US) (2) an input output interface coupled to each of the accelera tors and (3) a general processing unit coupled to the input output interface and con?gured as a programmable function (21) Appl. No.: 13/724,233 unit of the graphics pipeline, the general processing unit con?gured to issue vector instructions via the input output interface to vector data paths for the programmable function (22) Filed: Dec. 21, 2012 unit. ,2/00 231 232 233 249 235 237 239 \ \ \ \ / / / 230 \ FIXED FUNCTION FIXED FUNCTION FIXED FUNCTION MULT-BANK VECTOR DATA VECTOR DATA UNIT UNIT UNIT ‘ \ MEMORY PATH A PATH B 240 \ 241 \I 242 \ 243 \ 248 If 244 f 245 f 246 v v v 1 GleO \ 1 213 /" 220 \ 217 GRAPHICS REGISTER FILE I \ 210 f CPU L2 215 m Patent Application Publication Jun. 26, 2014 Sheet 1 0f 3 US 2014/0176569 A1

106 \ A COMPUTING GRAPHICS PROCESSING SUBSYSTEM 925651" GPU LOCAL MEMORY ,/ FRAME BUFFER 3 : SYSTEM / DATA 126 f 132 104 GPU DATA BUS \ ’/ 134 120 SYSTEM MEMORY APPLICATION GRAPHICS PROGRAM PROCESSING UNIT 114 \ ACCELERATORS \ 112 1 1/7 1/18 <}:{> APPLICATION " <}::{> PROGRAMMING IO INTERFACE INTERFACE 119 “ \ I, GPU DRIVER GENERAL \ PROCESSING UNIT 1 15

136 1 16 CENTRAL PROCESSING ON-CI-IIP GPU 122 GI; UNIT DATA BUS / \ ON-CI-IIP GPU MEMORY 102 GPU PROGRAMMING <2 INPUT DEVICES CODE \ / <::{> 108 1 28 1 30 / ON-CHIP BUFFERS {} V 110 _/ DISPLAY DEVICES 1 Patent Application Publication Jun. 26, 2014 Sheet 2 0f3 US 2014/0176569 A1

com

8w4/ 2%i/ Patent Application Publication Jun. 26, 2014 Sheet 3 0f 3 US 2014/0176569 A1

300 START / II 31 O \ COUPLE A FIXED FUNCTION UNIT TO AN INPUT OUTPUT INTERFACE

II 320 \ COUPLE A GENERAL PROCESSING UNIT T0 THE INPUT OUT INTERFACE

II 330 PROGRAM THE GENERAL PROCESSING \ UNIT TO EMULATE A FUNCTION UNIT OF A GRAPHICS PIPELINE

II CONFIGURE THE MEMORY AS A GRAPHICS 340 / REGISTER FOR THE GRAPHICS PIPELINE

II COUPLE VECTOR DATA PATHS TO THE INPUT OUTPUT INTERFACE, WHEREIN THE VECTOR 350 / DATA PATHS ARE ASSOCIATED WITH THE GENERAL PROCESSING UNIT FUNCTION UNIT

II 360 END FIG. 3 US 2014/0176569 A1 Jun. 26, 2014

GRAPHICS PROCESSING UNIT EMPLOYING DETAILED DESCRIPTION A STANDARD PROCESSING UNIT AND A [0010] In pipelines with software-de?ned functional units, METHOD OF CONSTRUCTING A GRAPHICS a custom processor is typically used to implement the pro PROCESSING UNIT grammable function stages and is built to process parallel TECHNICAL FIELD code. Thus, the custom processor requires the necessary modi?cations and associated expense to achieve a processor [0001] This application is directed, in general, to graphics speci?cally con?gured to process parallel code. While well processing units (GPUs) and, more speci?cally, to compo suited for highly parallel code, the special purpose processor nents of a GPU. is ill-suited for scalar code. [0011] As such, disclosed herein is a programmable graph BACKGROUND ics pipeline that employs a general processing unit to imple ment function stages of the pipeline. The programmable [0002] In traditional GPUs, ?xed function units are stati graphics pipeline advantageously treats vector data paths of a cally connected together to form a ?xed function graphics pipeline. The output packets of each of the ?xed function function stage as accelerators and attaches the vector data paths to an input output interface that provides program units, or ?xed function stages, are designed to match the input mable, non-static connections between stages of the graphics packets of the downstream ?xed function unit. A more ?ex pipeline. The general processing unit is a standard processing ible approach is to de?ne the graphics pipeline in software as unit having a scalar core that is suited for processing scalar a program or programs running on a programmable proces code. For example, the general processing unit can be a stan sor. In such a pipeline, the functional stages are implemented in software with data being moved via a regular general dard CPU, such as an ARM A9/A15. As disclosed herein the general processing unit also includes a processor memory. purpose memory system. [0012] An accelerator is a hardware implemented unit that SUMMARY is used to accelerate processing in the graphics pipeline. An example of an accelerator is a ?xed function rasterizer or a [0003] In one aspect, a graphics pipeline is disclosed. In texture unit. Thus, the disclosure provides a graphics pipeline one embodiment, the graphics pipeline includes: (1) accel architecture having a general processing unit that issues vec erators, (2) an input output interface coupled to each of the tor instructions into an input output interface as regular graph accelerators and (3) a general processing unit coupled to the ics requests. A request is a work request to be performed by input output interface and con?gured as a programmable one of the accelerators. The vector instructions, therefore, can function unit of the graphics pipeline, the general processing be issued via a control bus connecting the general processing unit con?gured to issue vector instructions via the input out unit to the input output interface. In one embodiment, the put interface to vector data paths for the programmable func input output interface is an input output connector as dis tion unit. closed in patent application Ser. No. 13/722,901, by Albert [0004] In another aspect, an apparatus is disclosed. In one Meixner, entitled “AN INPUT OUTPUT CONNECTOR embodiment, the apparatus includes: (1) a scalar processing FOR ACCESSING GRAPHICS FIXED FUNCTION core programmed to emulate a function unit of a graphics UNITS IN A SOFTWARE-DEFINED PIPELINE AND A pipeline and (2) a memory directly coupled to the scalar METHOD OF OPERATINGA PIPELINE,” ?led on the same processing core and including a graphics register con?gured day as this disclosure and incorporated herein be reference. to store input and output operands for accelerators of the [0013] The general processing unit includes a processor graphics pipeline. memory incorporated therein having a graphics register con [0005] In yet one other aspect, a method of manufacturing ?gured to include input and output operands for the vector a graphics processing unit is disclosed. In one embodiment, data paths and the other accelerators of the graphics pipeline. the method includes: (1) coupling a ?xed function unit to an [0014] In one embodiment, the processor memory is an L2 input output interface, (2) coupling a general processing unit cache. An L2 or level 2 cache is part of a multi-level storage to the input output interface, (3) programming the general architecture of the general processing unit. In some embodi processing unit to emulate a function unit of a graphics pipe ments the processor memory is built into the of line and (4) coupling vector data paths to the input output the general processing unit. The processor memory can be a interface, wherein the vector data paths are associated with random access memory (RAM) or another type of volatile the general processing unit function unit. memory. The processor memory is used to minimize the latency penalty on the general processing unit for sharing a BRIEF DESCRIPTION data-path connecting the general processing unit to the input output interface while still providing large amounts of band [0006] Reference is now made to the following descriptions width between the processing core of the processing unit and taken in conjunction with the accompanying drawings, in the vector paths. which: [0015] The size of the processor L2 cache allows for the [0007] FIG. 1 is a block diagram of an embodiment of a graphics register included therein to have a large number of computing system in which one or more aspects of the dis active vector registers. In some embodiments, L2 caches are closure may be implemented; sized at about 256 KB to about 2 MB in size. In these embodi [0008] FIG. 2 illustrates a block diagram of an embodiment ments, about half of the caches are used for registers, so 64K of a graphics pipeline constructed according to the principles to 512K registers. In one embodiment, a minimum of at least of the disclosure; and registers per fast memory access (FMA) unit in the vector [0009] FIG. 3 illustrates a ?ow diagram of a method of data paths 237 and 239. As such, the disclosed programmable manufacturing a graphics pipeline carried out according to graphics pipeline enables a large number of graphics threads the principles of the disclosure. to tolerate latency. US 2014/0176569 A1 Jun. 26, 2014

[0016] The general processing unit does not need to be tasks to complete) to a graphics processing unit 116 to com massively multithreaded. In one embodiment, the general plete. The system memory 104 typically includes dynamic processing unit is designed to support one hardware thread random access memory (DRAM) used to store programming context per cooperative thread array (CTA) (4-8 total) in order instructions and data for processing by the CPU 102 and the to run per-CTA pipeline emulation code without requiring graphics processing subsystem 106. The graphics processing software context switches. subsystem 106 receives the transmitted work from the CPU [0017] In one embodiment, data stored on the processor 102 and processes the work employing a graphics processing memory, e.g., input and output operands, is accessed in unit (GPU) 116 thereof. In this embodiment, the GPU 116 blocks by the input output interface to maximize bandwidth at completes the work in order to render and display graphics a low cost. In one embodiment, the block size is in a range of images on the display devices 110. In other embodiments, the 64 bytes to 256 bytes. GPU 116 or the graphics processing subsystem 106 as a [0018] In one embodiment disclosed herein, a scratch whole can be used for non-graphics processing. RAM for the processing unit of the programmable graphics [0025] As also shown, the system memory 104 includes an pipeline is also implemented as an accelerator. As such, a application program 112, an application programming inter multi-bank memory is provided as an additional accelerator face (API) 114, and a graphics processing unit (GPU) driver to support a scratch RAM for the processing unit, which 115. The application program 112 generates calls to the API requires highly divergent accesses. In some embodiments, a 114 in order to produce a desired set of results, typically in the texture stage of a graphics pipeline can minimize cost by form of a sequence of graphics images. using the scratch RAM accelerator for its data banks, which [0026] The graphics processing subsystem 106 includes also require highly divergent accesses. the GPU 116, an on-chip GPU memory 122, an on-chip GPU [0019] The graphics pipeline architecture provided herein data bus 136, a GPU local memory 120, and a GPU data bus bene?cially allows a lower entry barrier into building a GPU. 134. The GPU 116 is con?gured to communicate with the Because a general processing unit is employed, details on-chip GPU memory 122 via the on-chip GPU data bus 136 involved in building a custom processor are avoided. Addi and with the GPU local memory 120 via the GPU data bus tionally, the created accelerators are mostly stateless and 134. The GPU 116 may receive instructions transmitted by easier to design and verify than a full custom processor. the CPU 102, process the instructions in order to render [0020] Considering the programmable side, most of the graphics data and images, and store these images in the GPU pipeline emulation code would run on a standard scalar core, local memory 120. Subsequently, the GPU 116 may display which lowers the amount of infrastructure, such as compilers, certain graphics images stored in the GPU local memory 120 debuggers and pro?lers, needed and makes it easier to achieve on the display devices 110. acceptable performance than on a custom scalar core. [0027] The GPU 116 includes accelerators 117, an IO inter [0021] Additionally, the disclosed architecture removes or face 118 and a general processing unit 119. The accelerators least reduces barriers between a general processing unit and 117 are hardware implemented processing blocks or units GPU. Switching between scalar and data parallel sections can that accelerate the processing of the general processing unit be done in a few cycles, since the general processing unit has 119. The accelerators 117 include a conventional ?xed func direct access to all vector registers that are stored in the tion unit or units having circuitry con?gured to perform a processor memory. As an additional bene?t, the general pro dedicated function. The accelerators 117 also include a vector cessing unit would have access to all accelerators coupled to data path or paths and a multi-bank memory that are associ the input output interface and could use them directly in ated with the general processing unit 119. otherwise scalar code. [0028] The IO interface 118 is con?gured to couple the [0022] Before describing various embodiments of the general processing unit 119 to each of the accelerators 117 novel programmable function unit and methods associated and provide the necessary conversions between software and therewith, a computing system within which the general pro hardware formats to allow communication of requests and cessing unit may be employed in a GPU will be described. responses between the general processing unit 119 and the [0023] FIG. 1 is a block diagram of one embodiment ofa accelerators 117. The IO interface 118 provides a program computing system 100 in which one or more aspects of the mable pipeline with dynamically de?ned connections invention may be implemented. The computing system 100 between stages instead of static connections. The IO interface includes a system data bus 132, a central CPU 102, input 118 can communicate various states between the general devices 108, a system memory 104, a graphics processing processing unit 119 and the accelerators 117. Inputs and subsystem 106, and display devices 110. In alternate embodi outputs are also translated between a software-friendly for ments, the CPU 102, portions of the graphics processing mat of the general processing unit 119 and a hardware subsystem 106, the system data bus 132, or any combination friendly format of the accelerators. Furthermore, the IO inter thereof, may be integrated into a single processing unit. Fur face 118 translates states between a software-friendly format ther, the functionality of the graphics processing subsystem and a hardware-friendly format. In one embodiment, the IO 106 may be included in a or in some other type of interface advantageously provides all of the above noted special purpose processing unit or co-processor. functions in a single processing block that is shared across all [0024] As shown, the system data bus 132 connects the of the accelerators 117 and the general processing unit 119. CPU 102, the input devices 108, the system memory 104, and As such, the IO interface advantageously provides a single the graphics processing subsystem 106. In alternate embodi interface that includes the necessary logic to dynamically ments, the system memory 100 may connect directly to the connect the general processing unit 119 to the accelerators CPU 102. The CPU 102 receives user input from the input 117 to form a graphics pipeline and manage instruction level devices 108, executes programming instructions stored in the communications therebetween. The IO connector includes system memory 104, operates on data stored in the system multiple components that provide bi-directional connections memory 104 and sends instructions and/or data (i.e., work or including bi-directional control connections to the scalar pro US 2014/0176569 A1 Jun. 26, 2014

cessor core of the general processing unit 119 and bi-direc may be embodied or carried out, a particular embodiment of tional data connections to the processor memory of the gen a programmable graphics pipeline will be described in the eral processing unit 119. environment of a GPU 200. [0029] The general processing unit 119 is a standard pro [0036] FIG. 2 illustrates a block diagram of an embodiment cessor having a scalar core that is suited for processing scalar of a programmable graphics pipeline 200 constructed accord code. The general processing unit 119 also includes a proces ing to the principles of the disclosure. The programmable sor memory that is directly connected to the scalar processor graphics pipeline 200 includes a general processing unit 210, core. By being directly connected, the processor memory is an IO interface 220 and accelerators 230. implemented as part of the general processing unit 119 such [0037] The general processing unit 210 is a standard pro the scalar processor core can communicate therewith without cessing unit having a processing core 212 and a processor employing the IO interface 118. Thus, turning brie?y to FIG. memory 214. The processor core 212 is a scalar core that is 2, both the processor memory 214 and the CPU 210 are part con?gured for processing scalar code. The processor memory of the general processing unit 119 and the CPU 210 does not 214 is incorporated as part of the general processing unit 210. go through the IO interface 118 to access the processor The processor memory 214 is con?gured to store data to be memory 214. In one embodiment, the CPU 102 may be processed by the processing core 212 or data to be sent to the employed as the general processing unit for the GPU 116 and accelerators 230 for processing. In one embodiment, the pro communicate with the accelerators via the system data bus cessor memory 214 is an L2 cache. In different embodiments, 132 and the IO interface 118. As such, the CPU 102 can be the processor memory 214 is random access memory (RAM) used with the general processing unit 119 or used instead of or a scratchpad. the general processing unit 119. [0038] The processor memory 214 includes a graphics reg [0030] Instead of coexisting as illustrated, in some embodi ister ?le 215. The graphics register ?le 215 is con?gured to ments the general processing unit 119 replaces the CPU 102 include accelerator data including input and output operands entirely. In this con?guration, there’s no distinction between for the accelerators 230. The graphics register ?le 215 is the GPU-local memory 120 and the system memory 104. It is con?gured to include a ?le of active vectors of the accelera replaced with one shared memory. For example, the CPU 102 tors 230. An active vector is a vector or vector data presently and system memory 104 can be removed and Application being processed in the graphics pipeline 200. Program 112, API 114, and GPU driver 115 reside in the [0039] The general processing unit 210 is connected to the GPU-local memory 120 and are executed by the General IO interface 220 via two bi-directional connections. As illus Processing Unit 119. More detail of an embodiment of a trated in FIG. 2, the processing core 212 is connected to the IO graphic pipeline according to the principles of the disclosure interface 220 via a bi-directional control connection 213 and is discussed below with respect to FIG. 2. the processor memory 214 is connected to the IO interface 220 via a bi-directional data connection 217. In one embodi [0031] The GPU 116 may be provided with any amount of ment, the bi-directional control connection 213 and the bi on-chip GPU memory 122 and GPU local memory 120, directional data connection 217 are conventional connectors including none, and may use on-chip GPU memory 122, GPU that are sized, positioned and terminated to at least provide local memory 120, and system memory 104 in any combina the communications associated therewith that are described tion for memory operations. herein. [0032] The on-chip GPU memory 122 is con?gured to [0040] The IO interface 220 is a single interface that con include GPU programming code 128 and on-chip buffers nects the general processing unit 210 to the accelerators 230 13 0. The GPU programming 128 may be transmitted from the to form a graphics pipeline. The IO interface 220 includes GPU driver 115 to the on-chip GPU memory 122 via the multiple components that are con?gured to provide non-per system data bus 132. manent connections between the general processing unit 210 [0033] The GPU local memory 120 typically includes less and the accelerators 230 and provide the proper conversions expensive off-chip dynamic random access memory to allow communication and processing of requests and (DRAM) and is also used to store data and programming used responses therebetween. In one embodiment, the IO interface by the GPU 116. As shown, the GPU local memory 120 220 provides dynamic connections between the general pro includes a frame buffer 126. The frame buffer 126 stores data cessing unit 210 and the accelerators 230 for each request for at least one two-dimensional surface that may be used to generated by the general processing unit 210. A request is an drive the display devices 110. Furthermore, the frame buffer instruction or command to perform an action or work on data. 126 may include more than one two-dimensional surface so A request is generated by the processing core 212 for one of that the GPU 116 can render to one two -dimensional surface the accelerators 230 to perform. A response is generated or while a second two-dimensional surface is used to drive the provided by one of the particular accelerators 230 as a result display devices 110. of the request. Associated with each request are parameters and state information that the accelerators 230 use to process [0034] The display devices 110 are one or more output the requests. In some embodiments, the IO interface 220 is the devices capable of emitting a visual image corresponding to IO interface 118 of FIG. 1. an input data signal. For example, a display device may be [0041] The accelerators 230 are hardware implemented built using a cathode ray tube (CRT) monitor, a liquid crystal units that are con?gured to accelerate processing of the display, or any other suitable display system. The input data graphics pipeline 200. The accelerators 230 include three signals to the display devices 110 are typically generated by ?xed function units 231, 232 and 233. The ?xed function scanning out the contents of one or more frames of image data units 231, 232, 233, are ?xed function units that are typically that is stored in the frame buffer 126. found in a GPU. Each of the ?xed function units 231, 232, [0035] Having described a computing system within which 233, is con?gured with the necessary circuitry dedicated to the disclosed programmable graphics pipeline and methods perform a particular function of a graphics pipeline. In one US 2014/0176569 A1 Jun. 26, 2014

embodiment, the ?xed function units 231, 232 and 233 are a processing core and a processor memory. In one embodiment, geometry assist stage, a surface assist stage and a texture the processing core is a scalar core and the processor memory stage, respectively. One skilled in the art will understand that is an L2 cache. Bi-directional control and data connections other types of ?xed function units can be employed as accel can be used to couple the input output interface to the general erators. processing unit. [0042] The accelerators 230 also include a multi-bank [0047] The processor memory is con?gured as a graphics memory 235, a vector data pathA denoted as 237 and a vector register for the graphics pipeline in a step 340. In one embodi data path B denoted as 239. In contrast to a programmable ment only a portion of the processor memory is con?gured as function unit implemented on special purpose processor con a graphics register. ?gured to operate highly parallel code, the general processing [0048] In a step 350, vector data paths are coupled to the unit 210 does not include vector data paths or a scratch RAM. input output interface. The vector data paths are hardware Instead, the functionality of these components has been implemented accelerators associated with the general pro placed external to a programmable processor and placed in a cessing unit function unit. Moving the vector processing out ?xed format as accelerators. As such, the processing core 212 from the processor core of a programmable processing unit issues vector instructions as regular graphics request to the allows an off-shelf CPU to be used for a programmable func particular vector data path, vector data path A 237 or vector tion unit. As such, a streaming multiprocessor can be replaced data path B 239. The vector instructions are communicated with a CPU. The method 300 then ends in a step 330. via the bi-directional control connection 213 to the IO inter [0049] The disclosed architecture provides many bene?ts face 220. The IO interface 220 translates the software format including a lower entry barrier into building a GPU. Since the vector instruction into a hardware format for the appropriate architecture can employ an existing CPU, many of the com vector data path and communicates the vector instruction plications and details involved in building a custom processor thereto via an accelerator connection that is speci?cally dedi can be avoided. The accelerators are mostly stateless and cated to a single one of the accelerators 230. The dedicated easier to design and verify than a full processor. On the accelerator connections are generally denoted in FIG. 2 as software side, most of the pipeline emulation code can run on elements 240 and speci?cally denoted as elements 241, 242, a standard scalar core, which lowers the amount of infrastruc 243, 244, 245 and 246. ture (compilers/debuggers/pro?lers) needed and makes it [0043] In FIG. 2, two of the accelerators, ?xed function unit easier to achieve acceptable performance than on a custom 233 and multi-bank memory 235 are directly coupled scalar core. together via a unit connector 248 and a unit connector 249. [0050] On the more technical side, the proposed architec The unit connectors 248, 249, provide access between two ture can remove or at least reduce barriers between the CPU different ones of the accelerators 230. For the programmable and GPU. Switching between scalar and data parallel sections graphics pipeline 200, the unit connectors 248, 249, provide would be done in a few cycles, since the CPU has direct access communication paths to transmit and receive data with the to all vector registers. As a side bene?t, the CPU would have multi-bank memory 235. The unit connectors 248, 249, there access to all accelerators and could use them directly in oth fore, can be used to reduce or minimize the cost of graphics erwise scalar code. pipeline by allowing the multi-bank memory 235 to be shared [0051] Due to the loose coupling of components in the by the ?xed function unit 233. For example, as noted above disclosed architecture, the design is highly modular. Some the ?xed function unit 233 can be a texture stage that typically modi?cations to the CPU part (to add the control bus) and the requires a memory allowing highly divergent accesses. With memory system (to add the data bus to L2), but the same the unit connectors 248, 249, the texture stage can use the graphics components could be connected to different CPU multi-bank memory 235 for its texture cache data banks. In implementations to hit different performance ratios. The dis one embodiment, the multi-bank memory 235 can be a closed architecture provides an unmodi?ed or mostly scratch RAM con?gured for highly divergent access. In one unmodi?ed core with loosely coupled vector and ?xed func embodiment, the unit connectors 248, 249, and the dedicated tion accelerators. In some embodiments, modi?cations accelerator connections 240 are conventional connectors that include the addition of a control bus to send commands to the are sized, positioned and terminated to at least provide the accelerators and receive responses. This functionality can be communications associated therewith that are described implemented using an existing memory. Nevertheless, a dedi herein. cated control bus can increase performance. Modi?cations [0044] FIG. 3 illustrates a ?ow diagram ofa method 300 of can also include pinning the registers of the cache in place and manufacturing a graphics pipeline carried out according to keep them from being evicted. the principles of the disclosure. The resulting graphics pipe [0052] While the method disclosed herein has been line can be, for example, the programmable graphics pipeline described and shown with reference to particular steps per 200 of FIG. 2 or the graphics pipeline of the GPU 116. The formed in a particular order, it will be understood that these method 300 begins in a step 305. steps may be combined, subdivided, or reordered to form an [0045] In a step 310, a ?xed function unit is coupled to an equivalent method without departing from the teachings of input output interface. The input output interface can be the the present disclosure. Accordingly, unless speci?cally indi interface 118 or the IO interface 220. In one embodiments, cated herein, the order or the grouping of the steps is not a multiple ?xed function units are coupled to the input output limitation of the present disclosure. interface. Each of the ?xed function units can be conventional [0053] A portion of the above-described apparatuses, sys ?xed function units. tems or methods may be embodied in or performed by vari [0046] A general processing unit is coupled to the input ous, such as conventional, digital data processors or comput output interface in a step 320. In a step 330, the general ers, wherein the computers are programmed or store processing unit is programmed to emulate a function unit of a executable programs of sequences of software instructions to graphics pipeline. The general processing unit includes a perform one or more of the steps of the methods. The software US 2014/0176569 A1 Jun. 26, 2014

instructions of such programs may represent algorithms and 6. The graphics pipeline as recited in claim 1 wherein said be encoded in machine-executable form on non-transitory general processing unit issues said vector instructions as a digital data storage media, e. g., magnetic or optical disks, request. random-access memory (RAM), magnetic hard disks, ?ash 7. The graphics pipeline as recited in claim 1 wherein said memories, and/or read-only memory (ROM), to enable vari accelerators include said vector data paths. ous types of digital data processors or computers to perform 8. The graphics pipeline as recited in claim 1 wherein said one, multiple or all of the steps of one or more of the above accelerators further include a volatile memory. described methods, or functions of the apparatuses described 9. The graphics pipeline as recited in claim 1 wherein said herein. As discussed with respect to disclosed embodiments, general processing unit has a standard scalar core. a general processing unit having a scalar core that is suited for 10. The graphics pipeline as recited in claim 1 wherein said processing scalar code is employed. graphics pipeline is modular. [0054] Portions of disclosed embodiments may relate to 11. The graphics pipeline as recited in claim 1 wherein two computer storage products with a non-transitory computer of said accelerators are directly coupled together. readable medium that have program code thereon for per 12. An apparatus, comprising: forming various computer-implemented operations that a scalar processing core programmed to emulate a function embody a part of an apparatus, system or carry out the steps of unit of a graphics pipeline; and a method set forth herein. Non-transitory used herein refers to a memory directly coupled to said scalar processing core all computer-readable media except for transitory, propagat and including a graphics register con?gured to store ing signals. Examples of non-transitory computer-readable input and output operands for accelerators of said graph media include, but are not limited to: magnetic media such as ics pipeline. hard disks, ?oppy disks, and magnetic tape; optical media 13. The apparatus as recited in claim 12 wherein said scalar such as CD-ROM disks; magneto-optical media such as ?op processing core does not include vector data paths associated tical disks; and hardware devices that are specially con?gured with said function unit. to store and execute program code, such as ROM and RAM devices. Examples of program code include both machine 14. The apparatus as recited in claim 12 wherein said code, such as produced by a compiler, and ?les containing memory is a L2 cache. higher level code that may be executed by the computer using 15. The apparatus as recited in claim 12 wherein said scalar an interpreter. processing core supports one hardware thread context per [0055] Those skilled in the art to which this application cooperative thread array. relates will appreciate that other and further additions, dele 16. A method of manufacturing a graphics processing unit tions, substitutions and modi?cations may be made to the employing a general processing unit, comprising: described embodiments. coupling a ?xed function unit to an input output interface; What is claimed is: coupling a general processing unit to said input output 1. A graphics pipeline, comprising: interface; accelerators; programming said general processing unit to emulate a an input output interface coupled to each of said accelera function unit of a graphics pipeline; and tors; and coupling vector data paths to said input output interface, a general processing unit coupled to said input output inter wherein said vector data paths are associated with said face and con?gured as a programmable function unit of general processing unit function unit. said graphics pipeline, said general processing unit con 17. The method as recited in claim 16 wherein said general ?gured to issue vector instructions via said input output processing unit includes a memory, said method further com interface to vector data paths for said programmable prising con?guring said memory as a graphics register for function unit. said graphics pipeline. 2. The graphics pipeline as recited in claim 1 wherein said 18. The method as recited in claim 16 further comprising general processing unit includes a memory having a graphics coupling a multi-bank memory to said input output interface, register con?gured to store input and output operands for said wherein said multi-bank memory is an accelerator of said vector data paths. graphics pipeline. 3. The graphics pipeline as recited in claim 2 wherein said 19. The method as recited in claim 16 wherein said general memory is a level two cache. processing unit has a scalar core and said programming is 4. The graphics pipeline as recited in claim 2 wherein said general processing unit has direct access to said graphics programming said scalar core. register. 20. The method as recited in claim 18 further comprising 5. The graphics pipeline as recited in claim 2 wherein said connecting said ?xed function unit to said multi-bank graphics register is con?gured to store input and output oper memory. ands for all of said accelerators.