US007991983B2
(12) Ulllted States Patent (10) Patent N0.: US 7,991,983 B2 Wolrich et a]. (45) Date of Patent: Aug. 2, 2011
(54) REGISTER SET USED IN MULTITHREADED 3,577,189 A 5/1971 Cocke et al. PARALLEL PROCESSOR ARCHITECTURE 3,792,441 A 2/1974 Wymore er a1~ 3,881,173 A 4/1975 Larsen et al. . . . 3,913,074 A 10/1975 Homber et a1. (75) Inventors: G1lbertWolr1ch, Framlngham, MA 3,940,745 A 2/1976 sajeva g (Us); Matthew J Adiletta, Worcester, 4,023,023 A 5/1977 BourreZ et a1. MA (US); William R. Wheeler, 4,130,890 A 12/1978 Adam Southborough, MA (US); Debra 2 giwlestetlal Bernstem,' Sudbury, MA (US),. Donald 4,454,595, , A @1984 Cagean e a . F- HOOP", Shrewsbury, MA (Us) 4,471,426 A 9/1984 McDonough 4,477,872 A 10/1984 Losq et al. (73) Assignee: Intel Corporation, Santa Clara, CA 4,514,807 A 4/1985 Nogi (Us) 4,523,272 A 6/1985 Fukunaga et a1. (Continued) ( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 FOREIGN PATENT DOCUMENTS U.S.C. 154(b) by 25 days. EP 0130 381 H1985 (21) App1.No.: 12/477,287 (Continued)
(65) Prior Publication Data “IOP Task Switching”, IBM Technical Disclosure Bulletin, Us 2009/0307469 A1 Dec. 10, 2009 33(“156'158’ 0“ 1990' (Continued) Related US. Application Data
(63) Continuation of application No. 10/070,091, ?led as _ _ _ 1 application No. PCT/US00/23993 on Aug. 31, 2000, Pr’mry Examl’qer * Em CO man now Pat No_ 7,546,444 (74) Attorney, Agent, orFirm * Blakely, Sokoloff, Tay1or&
. . . . Zafman LLP (60) Prov1s1onal appl1cat1on No. 60/151,961, ?led on Sep. 1, 1999. (57) ABSTRACT (51) Int. Cl. G06F 9/46 (2006-01) A parallel hardware-based multithreaded processor. The pro G06F 15/16 (2006-01) cessor includes a general purpose processor that coordinates U-S. Cl...... System functions and a plurality Ofmicroengines that Support (58) Field of Classi?cation Search ...... None multiple hardware threads or contexts. The processor main See application ?le for complete search history. tains execution threads. The execution threads access a reg ister set organized into a plurality of relatively addressable (56) References Cited Windows of registers that are relatively addressable per thread. US. PATENT DOCUMENTS 3,373,408 A 3/1968 Ling 3,478,322 A 11/1969 Evans 22 Claims, 6 Drawing Sheets
20 25‘ ANDCSRs \ 111115115301 SUM“ MEMORY m1 FbusIPCl/AMEA rnocasson 1: CORE 1a» 111113105101 AME-1L" 1 \ 26b ,0 51151541101 SRAMUNIT \ LPNs/MBA AmbaXlilm ,3, SRAM
16: A | | v 21F \ Fbus 1112115105 5 SCRATCH CSRS 23a MICKOENGD‘IES. 11
I3 100mm GIGAEIT 0m] MAC ETHERNET
133
US 7,991,983 B2 Page 3
5,905,876 A 5/1999 Pawlowski et al. 6,212,542 B1 4/2001 Kahle et al. 5,905,889 A 5/1999 Wilhelm, Jr. 6,212,611 B1 4/2001 NiZar et al. 5,915,123 A 6/1999 Mirsky et al. 6,216,220 B1 4/2001 Hwang 5,923,872 A 7/1999 Chrysos et al. 6,223,207 B1 4/2001 Lucovsky et al. 5,926,646 A 7/1999 Pickett et al. 6,223,208 B1 4/2001 Kiefer et al. 5,928,358 A 7/1999 Takayama et al. 6,223,238 B1 4/2001 Meyer et al. 5,933,627 A 8/1999 Parady ...... 712/228 6,223,277 B1 4/2001 Karguth 5,937,177 A 8/1999 Molnar et al. 6,223,279 B1 4/2001 Nishimura et al. 5,937,187 A 8/1999 Kosche et al. 6,230,119 B1 5/2001 Mitchell 5,938,736 A 8/1999 Muller et al. 6,230,230 B1 5/2001 Joy et al. 5,940,612 A 8/1999 Brady et al. 6,230,261 B1 5/2001 Henry et al. 5,940,866 A 8/1999 Chisholm et al. 6,247,025 B1 6/2001 Bacon 5,943,491 A 8/1999 Sutherland et al. 6,256,713 B1 7/2001 Audityan et al. 5,944,816 A 8/1999 Dutton et al. 6,259,699 B1 7/2001 Opalka et al. 5,946,222 A 8/1999 Redford 6,269,391 B1 7/2001 Gillespie 5,946,487 A 8/1999 Dangelo 6,272,613 B1 8/2001 Bouraoui et al...... 711/204 5,948,081 A 9/1999 Foster 6,272,616 B1 8/2001 Fernando et al. 5,951,679 A 9/1999 Anderson et al. 6,275,505 B1 8/2001 O’Loughlin et al. 5,956,514 A 9/1999 Wen et al. 6,275,508 B1 8/2001 Aggarwal et al. 5,958,031 A 9/1999 Kim 6,279,066 B1 8/2001 Velingker 5,961,628 A 10/1999 Nguyen et al. 6,279,113 B1 8/2001 Vaidya 5,970,013 A 10/1999 Fischer et al. 6,289,011 B1 9/2001 Seo et al. 5,974,523 A 10/1999 Glew et al...... 712/23 6,292,483 B1 9/2001 Kerstein 5,978,838 A 11/1999 Mohamed et al. 6,298,370 B1 10/2001 Tang et al. 5,983,274 A 11/1999 Hyder et al. 6,304,956 B1 10/2001 Tran 5,993,627 A 11/1999 Anderson et al. 6,307,789 B1 10/2001 Wolrich et al. 5,996,068 A 11/1999 Dwyer, III et al. 6,314,510 B1 11/2001 Saulsbury et al. 6,002,881 A 12/1999 York et al. 6,324,624 B1 11/2001 Wolrich et al. 6,005,575 A 12/1999 Colleran et al. 6,336,180 B1 1/2002 Long et al. 6,009,505 A 12/1999 Thayer et al. 6,338,133 B1 1/2002 Schroter 6,009,515 A 12/1999 Steele, Jr. 6,345,334 B1 2/2002 Nakagawa et al. 6,012,151 A 1/2000 Mano 6,347,344 B1 2/2002 Baker et al. 6,014,729 A 1/2000 Lannan et al. 6,351,808 B1 2/2002 Joy et al...... 712/228 6,023,742 A 2/2000 Ebeling et al. 6,356,962 B1 3/2002 Kasper et al. 6,029,228 A 2/2000 Cai et al. 6,356,994 B1 3/2002 Barry et al. 6,052,769 A 4/2000 Huff et al. 6,360,262 B1 3/2002 Guenthner et al. 6,058,168 A 5/2000 Braband 6,373,848 B1 4/2002 Allison et al. 6,058,465 A 5/2000 Nguyen 6,378,124 B1 4/2002 Bates et al. 6,061,710 A 5/2000 Eickemeyer et al. 6,378,125 B1 4/2002 Bates et al. 6,061,711 A 5/2000 Song et al. 6,385,720 B1 5/2002 Tanaka et al. 6,067,585 A 5/2000 Hoang 6,389,449 B1 5/2002 Nemirovsky et al. 6,070,231 A 5/2000 Ottinger 6,393,483 B1 5/2002 Latif et al. 6,072,781 A 6/2000 Feeney et al. 6,401,155 B1 6/2002 Saville et al. 6,073,215 A 6/2000 Snyder 6,408,325 B1 6/2002 Shaylor 6,076,158 A 6/2000 Sites et al. 6,415,338 B1 7/2002 Habot 6,079,008 A 6/2000 Clery, 111 6,426,940 B1 7/2002 Seo et al. 6,079,014 A 6/2000 Papworth et al. 6,427,196 B1 7/2002 Adiletta et al. 6,081,880 A 6/2000 Sollars ...... 711/202 6,430,626 B1 8/2002 Witkowski et al. 6,085,215 A 7/2000 Ramakrishnan et al. 6,434,145 B1 8/2002 Opsasnick et al. 6,085,294 A 7/2000 Van Doren et al. 6,442,669 B2 8/2002 Wright et al. 6,088,783 A 7/2000 Morton 6,446,190 B1 9/2002 Barry et al. 6,092,127 A 7/2000 Tausheck 6,463,072 B1 10/2002 Wolrich et al. 6,092,158 A 7/2000 Harriman et al. 6,496,847 B1 12/2002 Bugnion et al. 6,092,175 A 7/2000 Levy et al...... 712/23 6,505,229 B1 1/2003 Turner et al. 6,100,905 A 8/2000 SidWell 6,523,108 B1 2/2003 James et al. 6,101,599 A 8/2000 Wright et al. 6,532,509 B1 3/2003 Wolrich et al. 6,112,016 A 8/2000 MacWilliams et al. 6,543,049 B1 4/2003 Bates et al. 6,115,777 A 9/2000 Zahir et al. 6,552,826 B2 4/2003 Adler et al. 6,115,811 A 9/2000 Steele, Jr. 6,560,629 B1 5/2003 Harris 6,134,665 A 10/2000 Klein et al. 6,560,667 B1 5/2003 Wolrich et al. 6,138,254 A 10/2000 Voshell 6,560,671 B1 5/2003 Samra et al. 6,139,199 A 10/2000 Rodriguez 6,564,316 B1 5/2003 Perets et al. 6,141,348 A 10/2000 MuntZ 6,574,702 B2 6/2003 Khanna et al. 6,141,689 A 10/2000 Ysrebi 6,577,542 B2 6/2003 Wolrich et al. 6,141,765 A 10/2000 Sherman 6,584,522 B1 6/2003 Wolrich et al. 6,144,669 A 11/2000 Williams et al. 6,587,906 B2 7/2003 Wolrich et al. 6,145,054 A 11/2000 Mehrotra et al. 6,606,704 B1 8/2003 Adiletta et al. 6,145,077 A 11/2000 SidWell et al. 6,625,654 B1 9/2003 Wolrich et al. 6,145,123 A 11/2000 Torrey et al. 6,629,237 B2 9/2003 Wolrich et al. 6,151,668 A 11/2000 Pechanek et al. 6,631,422 B1 10/2003 Althaus et al. 6,157,955 A 12/2000 Narad et al. 6,631,430 B1 10/2003 Wolrich et al. 6,157,988 A 12/2000 Dowling 6,631,462 B1 10/2003 Wolrich et al. 6,160,562 A 12/2000 Chin et al. 6,661,794 B1 12/2003 Wolrich et al. 6,182,177 B1 1/2001 Harriman 6,667,920 B2 12/2003 Wolrich et al. 6,195,676 B1 2/2001 Spix et al. 6,668,317 B1 12/2003 Bernstein et al. 6,195,739 B1 2/2001 Wright et al. 6,681,300 B2 1/2004 Wolrich et al. 6,199,133 B1 3/2001 Schnell 6,694,380 B1 2/2004 Wolrich et al. 6,201,807 B1 3/2001 Prasanna 6,697,935 B1 2/2004 Borkenhagen et al. 6,205,468 B1 3/2001 Diepstraten et al. 6,718,457 B2 4/2004 Tremblay et al. US 7,991,983 B2 Page 4
6,784,889 B1 8/2004 Radke Bowden, Romilly, “What is HART?” Romilly’s HART® and 6,836,838 B1 12/2004 Wright et al. Fieldbus Web Site, 3 pages,
Patterson, et al., Computer Architecture A Quantitative Approach, Tremblay et al., “A Three Dimensional Register File for Superscalar 2nd Ed., Morgan Kaufmann Publishers, Inc., pp. 447-449, 1996. Processors,” IEEE Proceedings of the 28th Annual Hawaii Interna Paver, et al., “Register Locking in an Asynchronous Microproces tional Conference on System Sciences, pp. 191-201, (1995). sor”, IEEE, pp. 351-355, 1992. Trimberger et al, “A time-multiplexed FPGA,” Proceedings ofthe 5th Philips EDiPhilips Components: “8051 Based 8 Bit Microcontrol Annual IEEE Symposium on Field-Programmable Custom Comput lers, Data Handbook Integrated Circuits, Book IC20”, 8051 Based 8 ing Machines, pp. 22-28, (1997). Turner et al ., “Design of a High Performance Active Router,” Internet Bit Microcontrollers, Eindhoven, Philips, NL, pp. 5-19 (1991). Document, Online!, 20 pages, Mar. 18, 1999. Probst et al., “Programming, Compiling and Executing Partially Vibhatavanij et al., “Simultaneous Multithreading-Based Routers,” Ordered Instruction Streams on Scalable Shared-Memory Multipro Proceedings of the 2000 International Conference of Parallel Pro cessors”, Proceedings of the 27’h Annual Hawaiian International cessing, Toronto, Ontario, Canada, Aug. 21-24, 2000, pp. 362-369. Conference on System Sciences, IEEE, pp. 584-593, (1994). Wadler, P., “The Concatenate Vanishes”, University of Glasgow, Dec. Quammen, D., et al., “Flexible Register Management for Sequential 1987 (revised Nov. 1989), pp. 1-7. Programs”, IEEE Proc. of the Annual International Symposium on Waldspurger et al., “Register Relocation: Flexible Contents for Computer Architecture, Toronto, May 27-30, 1991, vol. SYMP. 18, Multithreading”, Proceedings of the 20thAnnual International Sym pp. 320-329. posium on Computer Architecture, pp. 120-130, (1993). Ramsey, N., “Correctness of Trap-Based Breakpoint Implementa WaZloWski et al., “PRISM-II computer and architecture,” IEEE Pro tions”, Proc. of the 21 stACM Symposium on the Principles of Pro ceedings, Workshop on FPGAs for Custom ComputingMachines, pp. gramming Languages, pp. 15-24 (1994). 9-16, (1993). Schmidt et al., “The Performance of Alternative Threading Architec Young, H.C., “Code Scheduling Methods for Some Architectural tures for Parallel Communication Subsystems,” Internet Document, Features in Pipe”, Microprocessing and Microprogramming, Onlinel, Nov. 13, 1998, pp. 1-19. Elsevier Science Publishers, Amsterdam, NL, 22(1):39-63, Jan. Sproull, R.F., et al., “The Counter?ow Pipeline Processor Architec 1988. ture”, IEEE Design & Test ofComputers, pp. 48-59, (1994). Young, H.C., “On Instruction and Data Prefetch Mechanisms”, Inter Steven, G.B., et al., “ALU design and processor branch architecture”, national Symposium on VLSI Technology, Systems and Applications, Microprocessing and Microprogramming, 36(5):259-278, Oct. 1993. Proc. ofTechnical Papers, pp. 239-246, (1995). Terada et al., “DDMP’s: Self-Timed Super Pipelined Data-Driven US. Appl. No. 09/473,571, ?led Dec. 28, 1999, Wolrich et al. Multimedia Processors”, IEEE, 282-296, 1999. US. Appl. No. 09/475,614, ?led Dec. 30, 1999, Wolrich et al. Thistle et al., “A Processor Architecture for Horizon,” IEEE Proc. Supercomputing ’88, pp. 35-41, (1988). * cited by examiner US. Patent Aug. 2, 2011 Sheet 1 of6 US 7,991,983 B2
14 \ PCIBUS
lb 1 31,41 ADUlzO] B 13 V l6\a A PCI Bus INTF. $34 20 26a AND CSRs S
4 Mbus[63:0] > ? SDRAM Fbus/PCI/AMBAMEMORY UNIT 3° PROCESSOR
, _ - CORE 16b ‘\ibusif?ml AMBAIJI \ \ 26b 11 30 51213115131101 \ SRAMUNIT g ‘ " FbuS/AMBA Amba Xlator /-34 SRAM * ifI r/ 166 +4- —>-I A V 7 22F \ l ‘ Fbus v INTERFACE y FLASHROM R/TFIFQS SCRATCH CSRS “Z3 \ M 2221 MICROENGINES. 22 ,1, " X 10 g “ FBUS[63:O] t 12 v 18 10/100BaseT GIGABIT Octal MAC ETHERNET \ \ 13a 13b FIG. 1
US. Patent Aug. 2, 2011 Sheet 5 of6 US 7,991,983 B2
8221. LOOKUP MICROINSTRUCTIONS 1 82b, FORM REGISTER FILE ADDRESSES I 82c, FETCH OPERANDS
82d, ALU OPERATION 1 82d, WRITE-BACK OF RESULTS
FIG. 4 US. Patent Aug. 2, 2011 Sheet 6 of6 US 7,991,983 B2
K- THREAD_3
Am < _
\ 'IHREAD_0
FIG. 5 US 7,991,983 B2 1 2 REGISTER SET USED IN MULTITHREADED ing such as in boundary conditions. In one embodiment, the PARALLEL PROCESSOR ARCHITECTURE processor 20 is a Strong Arm® (Arm is a trademark of ARM Limited, United Kingdom) based architecture. The general RELATED APPLICATIONS purpose microprocessor 20 has an operating system. Through the operating system the processor 20 can call functions to This application is a continuation application and claims operate on microengines 22a-22f The processor 20 can use the bene?t of US. application Ser. No. 10/070,091, ?led on any supported operating system preferably a real time oper Jun. 28, 2002 (now US. Pat. No. 7,546,444), Which is the ating system. For the core processor implemented as a Strong national phase application claiming bene?t to International Arm architecture, operating systems such as, Microsoft NT® Application Serial No. PCT/US00/23993, ?led on Aug. 31, real-time, VXWorks and :CUS, a freeWare operating system 2000, Which claims bene?t to US. Provisional Application available over the Internet, can be used. Ser. No. 60/151,961, ?led on Sep. 1, 1999. These applications The hardWare-based multithreaded processor 12 also are hereby incorporated in their entirety. includes a plurality of function microengines 22a-22f Func tional microengines (microengines) 22a-22f each maintain a BACKGROUND plurality of program counters in hardWare and states associ This invention relates to computer processors. ated With the program counters. Effectively, a corresponding Parallel processing is an ef?cient form of information pro plurality of sets of threads can be simultaneously active on cessing of concurrent events in a computing process. Parallel each of the microengines 22a-22fWhile only one is actually processing demands concurrent execution of many programs 20 operating at any one time. in a computer, in contrast to sequential processing. In the In one embodiment, there are six microengines 22a-22f as context of a parallel processor, parallelism involves doing shoWn. Each microengines 22a-22f has capabilities for pro more than one thing at the same time. Unlike a serial para cessing four hardWare threads. The six microengines 22a-22f digm Where all tasks are performed sequentially at a single operate With shared resources including memory system 16 station or a pipelined machine Where tasks are performed at 25 and bus interfaces 24 and 28. The memory system 16 includes specialiZed stations, With parallel processing, a plurality of a Synchronous Dynamic Random Access Memory stations are provided With each capable of performing all (SDRAM) controller 26a and a Static Random Access tasks. That is, in general all or a plurality of the stations Work Memory (SRAM) controller 26b. SDRAM memory 16a and simultaneously and independently on the same or common SDRAM controller 26a are typically used for processing elements of a problem. Certain problems are suitable for 30 large volumes of data, e.g., processing of netWork payloads solution by applying parallel processing. from netWork packets. The SRAM controller 26b and SRAM memory 16b are used in a networking implementation for loW BRIEF DESCRIPTION OF THE DRAWINGS latency, fast access tasks, e.g., accessing look-up tables, memory for the core processor 20, and so forth. FIG. 1 is a block diagram of a communication system 35 The six microengines 22a-22f access either the SDRAM employing a hardWare-based multithreaded processor. 1611 or SRAM 16b based on characteristics of the data. Thus, FIG. 2 is a detailed block diagram of the hardWare-based loW latency, loW bandWidth data is stored in and fetched from multithreaded processor of FIG. 1. SRAM, Whereas higher bandWidth data for Which latency is FIG. 3 is a block diagram of a microengine functional unit not as important, is stored in and fetched from SDRAM. The employed in the hardWare-based multithreaded processor of 40 microengines 22a-22fcan execute memory reference instruc FIGS. 1 and 2. tions to either the SDRAM controller 26a or SRAM control FIG. 4 is a block diagram of a pipeline in the microengine ler 16b. of FIG. 3. Advantages of hardWare multithreading can be explained FIG. 5 is a block diagram shoWing general purpose register by SRAM or SDRAM memory accesses. As an example, an address arrangement. 45 SRAM access requested by a Thread_0, from a microengine Will cause the SRAM controller 26b to initiate an access to the DESCRIPTION SRAM memory 16b. The SRAM controller controls arbitra tion for the SRAM bus, accesses the SRAM 16b, fetches the Referring to FIG. 1, a communication system 10 includes a data from the SRAM 16b, and returns data to a requesting parallel, hardWare-based multithreaded processor 12. The 50 microengine 22a-22b. During an SRAM access, if the hardWare-based multithreaded processor 12 is coupled to a microengine e. g., 2211 had only a single thread that could bus such as a PCI bus 14, a memory system 16 and a second operate, that microengine Would be dormant until data Was bus 18. The system 10 is especially useful for tasks that can be returned from the SRAM. By employing hardWare context broken into parallel subtasks or functions. Speci?cally hard sWapping Within each of the microengines 2211-22], the hard Ware-based multithreaded processor 12 is useful for tasks that 55 Ware context sWapping enables other contexts With unique are bandWidth oriented rather than latency oriented. The program counters to execute in that same microengine. Thus, hardWare-based multithreaded processor 12 has multiple another thread e.g., Thread_l can function While the ?rst microengines 22 each With multiple hardWare controlled thread, e.g., Thread_0, is aWaiting the read data to return. threads that can be simultaneously active and independently During execution, Thread_l may access the SDRAM Work on a task. 60 memory 16a. While Thread_1 operates on the SDRAM unit, The hardWare-based multithreaded processor 12 also and Thread_0 is operating on the SRAM unit, a neW thread, includes a central controller 20 that assists in loading micro e.g., Thread_2 can noW operate in the microengine 22a. code control for other resources of the hardWare-based mul Thread_2 can operate for a certain amount of time until it tithreaded processor 12 and performs other general purpose needs to access memory or perform some other long latency computer type functions such as handling protocols, excep 65 operation, such as making an access to a bus interface. There tions, extra support for packet processing Where the fore, simultaneously, the processor 12 can have a bus opera microengines pass the packets off for more detailed process tion, SRAM operation and SDRAM operation all being com US 7,991 ,983 B2 3 4 pleted or operated upon by one microengine 22a and have one System Bus) that couples the processor core 20 to the memory more thread available to process more Work in the data path. controller 2611, 26c and to an ASB translator 30 described The hardWare context swapping also synchronizes beloW. The ASB bus is a subset of the so called AMBA bus completion of tasks. For example, tWo threads could hit the that is used With the Strong Arm processor core. The proces same shared resource e.g., SRAM. Each one of the separate sor 12 also includes a private bus 34 that couples the functional units, e.g., the FBUS interface 28, the SRAM microengine units to SRAM controller 26b, ASB translator controller 26a, and the SDRAM controller 26b, When they 30 and FBUS interface 28. A memory bus 38 couples the complete a requested task from one of the microengine thread memory controller 26a, 26b to the bus interfaces 24 and 28 contexts reports back a ?ag signaling completion of an opera and memory system 16 including ?ashrom 160 used for boot tion. When the ?ag is received by the microengine, the operations and so forth. microengine can determine Which thread to turn on. Referring to FIG. 2, each of the microengines 22a-22f One example of an application for the hardWare-based includes an arbiter that examines ?ags to determine the avail multithreaded processor 12 is as a netWork processor. As a able threads to be operated upon. Any thread from any of the netWork processor, the hardWare-based multithreaded pro microengines 22a-22fcan access the SDRAM controller 26a, cessor 12 interfaces to netWork devices such as a media access SDRAM controller 26b or FBUS interface 28. The memory controller device e.g., a 10/ l00BaseT Octal MAC 1311 or a controllers 26a and 26b each include a plurality of queues to Gigabit Ethernet device 13b. In general, as a netWork proces store outstanding memory reference requests. The queues sor, the hardWare-based multithreaded processor 12 can inter either maintain order of memory references or arrange face to any type of communication device or interface that memory references to optimiZe memory bandWidth. For receives/ sends large amounts of data. Communication system 20 example, if a thread_0 has no dependencies or relationship to 10 functioning in a netWorking application could receive a a thread_1, there is no reason that thread 1 and 0 cannot plurality of netWork packets from the devices 13a, 13b and complete their memory references to the SRAM unit out of process those packets in a parallel manner. With the hard order. The microengines 22a-22f issue memory reference Ware-based multithreaded processor 12, each netWork packet requests to the memory controllers 26a and 26b. The can be independently processed. 25 microengines 22a-22f ?ood the memory subsystems 26a and Another example for use of processor 12 is a print engine 26b With enough memory reference operations such that the for a postscript processor or as a processor for a storage memory subsystems 26a and 26b become the bottleneck for subsystem, i.e., RAID disk storage. A further use is as a processor 12 operation. matching engine. In the securities industry for example, the If the memory subsystem 16 is ?ooded With memory advent of electronic trading requires the use of electronic 30 requests that are independent in nature, the processor 12 can matching engines to match orders betWeen buyers and sellers. perform memory reference sorting. Memory reference sort These and other parallel types of tasks can be accomplished ing improves achievable memory bandwidth. Memory refer on the system 10. ence sorting, as described beloW, reduces dead time or a The processor 12 includes a bus interface 28 that couples bubble that occurs With accesses to SRAM. With memory the processor to the second bus 18. Bus interface 28 in one 35 references SRAM, sWitching current direction on signal lines embodiment couples the processor 12 to the so-called FBUS betWeen reads and Writes produces a bubble or a dead time 18 (FIFO bus). The FBUS interface 28 is responsible for Waiting for current to settle on conductors coupling the controlling and interfacing the processor 12 to the FBUS 18. SRAM 16b to the SRAM controller 26b. The FBUS 18 is a 64-bit Wide FIFO bus, used to interface to That is, the drivers that drive current on the bus need to Media Access Controller (MAC) devices. 40 settle out prior to changing states. Thus, repetitive cycles of a The processor 12 includes a second interface e. g., a PCI bus read folloWed by a Write can degrade peak bandWidth. interface 24 that couples other system components that reside Memory reference sorting alloWs the processor 12 to organiZe on the PCI 14 bus to the processor 12. The PCI bus interface references to memory such that long strings of reads can be 24, provides a high speed data path 24a to memory 16 e.g., the folloWed by long strings of Writes. This can be used to mini SDRAM memory 16a. Through that path data can be moved miZe dead time in the pipeline to effectively achieve closer to quickly from the SDRAM 1611 through the PCI bus 14, via maximum available bandWidth. Reference sorting helps direct memory access (DMA) transfers. The hardWare based maintain parallel hardWare context threads. On the SDRAM, multithreaded processor 12 supports image transfers. The reference sorting alloWs hiding of pre-charges from one bank hardWare based multithreaded processor 12 can employ a to another bank. Speci?cally, if the memory system 16b is plurality of DMA channels so if one target of a DMA transfer 50 organiZed into an odd bank and an even bank, While the is busy, another one of the DMA channels can take over the processor is operating on the oddbank, the memory controller PCI bus to deliver information to another target to maintain can start precharging the evenbank. Precharging is possible if high processor 12 e?iciency. Additionally, the PCI bus inter memory references alternate betWeen odd and even banks. By face 24 supports target and master operations. Target opera ordering memory references to alternate accesses to opposite tions are operations Where slave devices on bus 14 access 55 banks, the processor 12 improves SDRAM bandWidth. Addi SDRAMs through reads and Writes that are serviced as a slave tionally, other optimiZations can be used. For example, merg to target operation. In master operations, the processor core ing optimiZations Where operations that can be merged, are 20 sends data directly to or receives data directly from the PCI merged prior to memory access, open page optimiZations interface 24. Where by examining addresses an opened page of memory is Each of the functional units are coupled to one or more 60 not reopened, chaining, as Will be described beloW, and internal buses. As described beloW, the internal buses are refreshing mechanisms, can be employed. dual, 32 bit buses (i.e., one bus for read and one for Write). The The FBUS interface 28 supports Transmit and Receive hardWare-based multithreaded processor 12 also is con ?ags for each port that a MAC device supports, along With an structed such that the sum of the bandWidths of the internal Interrupt ?ag indicating When service is Warranted. The buses in the processor 12 exceed the bandWidth of external 65 FBUS interface 28 also includes a controller 28a that per buses coupled to the processor 12. The processor 12 includes forms header processing of incoming packets from the FBUS an internal core processor bus 32, e. g., anASB bus (Advanced 18. The controller 28a extracts the packet headers and per US 7,991,983 B2 5 6 forms a microprogrammable source/destination/protocol Although microengines 22 can use the register set to hashed lookup (used for address smoothing) in SRAM. If the exchange data as described beloW, a scratchpad memory 27 is hash does not successfully resolve, the packet header is sent to also provided to permit microengines to Write data out to the the processor core 20 for additional processing. The FBUS memory for other microengines to read. The scratchpad 27 is interface 28 supports the following internal data transactions: coupled to bus 34. The processor core 20 includes a RISC core 50 imple mented in a ?ve stage pipeline performing a single cycle shift of one operand or tWo operands in a single cycle, provides FBUS unit (Shared bus SRAM) to/from microengine. multiplication support and 32 bit barrel shift support. This FBUS unit (via private bus) Writes from SDRAM Unit. RISC core 50 is a standard Strong Arm® architecture but it is FBUS unit (via Mbus) Reads to SDRAM. implemented With a ?ve stage pipeline for performance rea sons. The processor core 20 also includes a 16 kilobyte The FBUS 18 is a standard industry bus and includes a data instruction cache 52, an 8 kilobyte data cache 54 and a bus, e.g., 64 bits Wide and sideband control for address and prefetch stream buffer 56. The core processor 20 performs read/Write control. The FBUS interface 28 provides the abil arithmetic operations in parallel With memory Writes and ity to input large amounts of data using a series of input and instruction fetches. The core processor 20 interfaces With output FIFO’s 2911-2919. From the FIFOs 2911-2919, the other functional units via the ARM de?ned ASB bus. The microengines 2211-22] fetch data from or command the ASB bus is a 32-bit bi-directional bus 32. SDRAM controller 2611 to move data from a receive FIFO in Microengines: Which data has come from a device on bus 18, into the FBUS 20 Referring to FIG. 3, an exemplary one of the microengines interface 28. The data can be sent through memory controller 2211-22], e.g., microengine 22] is shoWn. The microengine 2611 to SDRAM memory 1611, via a direct memory access. includes a control store 70 Which, in one implementation, Similarly, the microengines can move data from the SDRAM includes a RAM of here 1,024 Words of 32 bit. The RAM 2611 to interface 28, out to FBUS 18, via the FBUS interface stores a microprogram. The microprogram is loadable by the 28. 25 core processor 20. The microengine 22] also includes con troller logic 72. The controller logic includes an instruction Data functions are distributed amongst the microengines. decoder 73 and program counter (PC) units 7211-7211. The four Connectivity to the SRAM 2611, SDRAM 26b and FBUS 28 is micro program counters 7211-7211 are maintained in hardWare. via command requests. A command request can be a memory The microengine 22] also includes context event sWitching request or a FBUS request. For example, a command request 30 logic 74. Context event logic 74 receives messages (e.g., can move data from a register located in a microengine 2211 to SEQ_EVENT_RESPONSE; FBI_EVENT_RESPONSE; a shared resource, e.g., an SDRAM location, SRAM location, SRAM_EVENT_RESPONSE; SDRAM_EVENT_RE ?ash memory or some MAC address. The commands are sent SPONSE; andASB_EVENT_RESPONSE) from each one of out to each of the functional units and the shared resources. the shared resources, e.g., SRAM 2611, SDRAM 26b, or pro HoWever, the shared resources do not need to maintain local 35 cessor core 20, control and status registers, and so forth. buffering of the data. Rather, the shared resources access These messages provide information on Whether a requested distributed data located inside of the microengines. This function has completed. Based on Whether or not a function enables microengines 2211-22], to have local access to data requested by a thread has completed and signaled completion, rather than arbitrating for access on a bus and risk contention the thread needs to Wait for that completion signal, and if the for the bus. With this feature, there is a 0 cycle stall for Waiting 40 thread is enabled to operate, then the thread is placed on an for data internal to the microengines 2211-22], available thread list (not shoWn). The microengine 22] can The data buses, e.g., ASB bus 30, SRAM bus 34 and have a maximum of e.g., 4 threads available. SDRAM bus 38 coupling these shared resources, e. g., In addition to event signals that are local to an executing memory controllers 2611 and 26b are of suf?cient bandWidth thread, the microengines 22 employ signaling states that are such that there are no internal bottlenecks. Thus, in order to 45 global. With signaling states, an executing thread can broad avoid bottlenecks, the processor 12 has an bandWidth require cast a signal state to all microengines 22. Receive Request ment Where each of the functional units is provided With at Available signal, Any and all threads in the microengines can least tWice the maximum bandWidth of the internal buses. As branch on these signaling states. These signaling states can be an example, the SDRAM can run a 64 bit Wide bus at 83 MHZ. used to determine availability of a resource or Whether a The SRAM data bus could have separate read and Write buses, 50 resource is due for servicing. e.g., could be a read bus of 32 bits Wide running at 166 MHZ The context event logic 74 has arbitration for the four (4) and a Write bus of 32 bits Wide at 166 MHZ. That is, in essence, threads. In one embodiment, the arbitration is a round robin 64 bits running at 166 MHZ Which is effectively tWice the mechanism. Other techniques could be used including prior bandWidth of the SDRAM. ity queuing or Weighted fair queuing. The microengine 22] The core processor 20 also can access the shared resources. 55 also includes an execution box (EBOX) data path 76 that The core processor 20 has a direct communication to the includes an arithmetic logic unit 7611 and general purpose SDRAM controller 2611 to the bus interface 24 and to SRAM register set 76b. The arithmetic logic unit 7611 performs arith controller 26b via bus 32. HoWever, to access the metic and logical functions as Well as shift functions. The microengines 2211-22] and transfer registers located at any of registers set 76b has a relatively large number of general the microengines 2211-22], the core processor 20 access the 60 purpose registers. As Will be described in FIG. 6, in this microengines 2211-22] via the ASB Translator 30 over bus 34. implementation there are 64 general purpose registers in a The ASB translator 30 can physically reside in the FBUS ?rst bank, Bank A and 64 in a second bank, Bank B. The interface 28, but logically is distinct. The ASB Translator 30 general purpose registers are WindoWed as Will be described performs an address translation betWeen FBUS microengine so that they are relatively and absolutely addressable. transfer register locations and core processor addresses (i.e., 65 The microengine 22] also includes a Write transfer register ASB bus) so that the core processor 20 can access registers stack 78 and a read transfer stack 80. These registers are also belonging to the microengines 22a-22c. WindoWed so that they are relatively and absolutely address US 7,991,983 B2 7 8 able. Write transfer register stack 78 is Where Write data to a Addressing of general purpose registers 78 occurs in 2 resource is located. Similarly, read register stack 80 is for modes depending on the microWord format. The tWo modes return data from a shared resource. Subsequent to or concur are absolute and relative. In absolute mode, addressing of a rent With data arrival, an event signal from the respective register address is directly speci?ed in 7-bit source ?eld (a6 shared resource e.g., the SRAM controller 26a, SDRAM a0 or b6-b0): controller 26b or core processor 20 Will be provided to con text event arbiter 74 Which Will then alert the thread that the data is available or has been sent. Both transfer register banks 78 and 80 are connected to the execution box (EBOX) 76 through a data path. In one implementation, the read transfer register has 64 registers and the Write transfer register has 64 registers. Referring to FIG. 4, the microengine datapath maintains a 5-stage micro-pipeline 82. This pipeline includes lookup of microinstruction Words 8211, formation of the register ?le addresses 82b, read of operands from register ?le 82c, ALU, shift or compare operations 82d, and Write-back of results to register address directly speci?ed in 8-bit dest ?eld (d7-d0): registers 82e. By providing a Write-back data bypass into the ALU/shifter units, and by assuming the registers are imple mented as a register ?le (rather than a RAM), the microengine 20 can perform a-simultaneous register ?le read and Write, 7 6 5 4 3 2 l 0 Which completely hides the Write operation. A GPR: d7 The SDRAM interface 2611 provides a signal back to the B GPR: d7 requesting microengine on reads that indicates Whether a SRAIVUASB: d7 parity error occurred on the read request. The microengine 25 microcode is responsible for checking the SDRAM read Par ity ?ag When the microengine uses any return data. Upon checking the ?ag, if it Was set, the act of branching on it clears it. The Parity ?ag is only sent When the SDRAM is enabled for If