HYBRIDS BUD ON EMBEDDED LANDSCAPE Meanwhile, ARM Allies Plot World Domination By Joseph Byrne (January 9, 2012) ......

A decade from now, 2011 will be FPGA suppliers disclosed details of their forthcoming de- remembered for one event in the world vices that integrate ARM Cortex-A9 CPUs and peripherals. of high-speed embedded processors: a Neither company had previously achieved lasting success in small, slow-growing supplier of adding hard CPUs to their FPGAs, but both promise that PowerPC chips staked its future on this time is different. Indeed, the technologies of the new high-performance multicore ARM chips are different. These hybrids can boot their CPUs processors. Looking back from 2021, before loading the FPGA configuration, enabling them to the bet will have paid off: this supplier will have transcended function more like embedded processors and less like an niche status, and other vendors will have similarly em- FPGA with a peripheral CPU. Running at 800MHz, the braced the ARM architecture. That’s the hope of Applied- ARM CPUs are speedy enough for mainstream embedded Micro and ARM, at least. In November, AppliedMicro un- systems. cloaked its ambitious plan to design custom 64-bit ARM- Embedded designs that today would pair a processor compatible CPUs, to develop multicore processors based like a Freescale QorIQ P1020 with a Spartan or thereon, and to aim these processors at both server and Altera Cyclone FPGA are strong candidates for these new communications markets. hybrids. The hybrid approach enables the FPGA company to The perspective at the end of 2011 is different. The big capture additional value (a fancy way of saying they can development in 2011 was the sampling of two new proces- charge more) by cutting the processor supplier out of the sor hybrids. Xilinx and Altera crossbred CPUs and FPGAs picture. Merging two chips into one reduces the board area, to reduce system cost in existing two-chip designs and to system power, and bill-of-materials cost of existing designs enable development of new designs where a single device is and enables new systems where power and performance the only practical solution. Freescale and numerous other requirements preclude a two-chip solution. companies crossbred CPUs, DSPs, and accelerator engines The Xilinx and Altera lines differ slightly, as Figure 1 to craft a new class of chip for small LTE base stations. shows. The four-member Xilinx family, dubbed Zynq, has Intended to ship in high volume and at low cost, an inte- devices with 28,000–350,000 logic cells and sampled in grated device is a boon for these cellular systems. In parallel, 2011. All have twin CPUs and an analog-to-digital con- embedded-processor vendors unveiled their 28nm road- verter. The two high-end FPGAs include 12.5Gbps serdes. maps, and Broadcom struck a $3.7 billion deal to acquire The two Zynqs with the fewest FPGA gates have no serdes NetLogic (Broadcom’s biggest deal since its 2000 acquisi- and hence no PCI Express (PCIe) connectivity—a serious tion of SiByte for $2 billion). omission given the ubiquity of PCIe in embedded process- ing. A 2.5Gbps or 5Gbps serdes would incur incremental FPGAs to CPUs: You Will Be Assimilated cost but broaden the usefulness of these chips. Apart from higher-profile developments around the ARM Altera did not coin a new brand name for its chips, architecture, ARM emerged as the chosen processor for the instead extending its FPGA Cyclone (low density) and Arria new CPU-FPGA hybrids from Xilinx and Altera. Both (midrange) brands to include its new hybrids. The company

JANUARY 2012 2 Hybrids Bud on Embedded Landscape

calls these hybrids SoC FPGAs, the most mellifluous moni- 28nm roadmaps in 2011. Incorporating a new CPU design, ker since PCMCIA. The Altera lineup is broader, extending Freescale’s upcoming processors significantly advance the from 25,000 to 462,000 logic elements. (A Xilinx logic QorIQ line. At about the same time, NetLogic’s new XLP II and an Altera logic element represent approximately the family will supplant many of the first-generation 40nm same capacity.) Altera, however, will not sample its CPU- XLP processors, and it extends the company’s top end to FPGA combinations until 2H12, giving Xilinx a year to win much higher performance. the first wave of designs and broaden its offerings. On bal- Freescale has publicly divulged a few details about its ance, Altera’s chips have better high-speed I/O options than 28nm QorIQ Amp processors but has held others close. The Zynq. The lowest-density Cyclone device includes no most important new ingredient is the 64-bit Power e6500 serdes, but the other Cyclones integrate 5Gbps transceivers. CPU. Departing from its predecessor, the e500/e5500, the These serdes can support PCIe Gen2 at lower manufacturing new CPU is a fused-core dual-thread implementation like costs compared with 10Gbps transceivers. The denser Arria AMD’s Bulldozer (see MPR 8/30/10, “AMD Bulldozer models have both 6Gbps and 10Gbps serdes. Plows New Ground”). The two threads share the front end of DSP capability, indicated by the size of the bubbles in the relatively short pipeline and the AltiVec SIMD unit, but Figure 1, varies proportionally with gate count. The Zynq they have independent integer execution units. Freescale devices have between 80 and 900 DSP slices. Each configur- claims a 70% speedup over a single-thread implementation able slice includes a 25x18-bit multiplier, an accumula- while incurring only a 30% area penalty. Better branch pre- tor/ALU, pre-adders, and other functions. Altera’s devices diction and higher clock rates improve performance com- provide 36–1,068 variable-precision DSP blocks, and each pared with the e500. Applications that use AltiVec, which block can perform a single 27x27-bit multiply, a pair of has not appeared in a Freescale CPU since the e600 (last used 18x19-bit multiplies, or three 9x9-bit multiplies. Most in the MPC8641 and MPC7448—see MPR 7/5/05, pronounced at the lower densities, Xilinx offers more DSP “PowerPC Ain’t Dead Yet”), will run even faster. units, but the difference narrows if a designer can use The first of Freescale’s new QorIQ Amp processors is Altera’s reduced-precision modes. In either case, these units the T4240. In the company’s logical naming scheme, the can implement video encoders, filters, and other signal- digits 24 in the part name indicate the number of threads. processing functions. For the automotive market, for exam- The chip divides the 12 dual-thread physical CPUs into ple, Xilinx has diagrammed Zynq-based systems that ana- clusters of 4. Data-plane accelerators like those in the 45nm lyze video of the road and provide lane-departure warnings. P-series handle packet processing. Freescale is aiming for 40Gbps of IPSec throughput—about four times that of the Roadmaps to 28nm in 2012 eight-core P4080 but on par with shipping top-end proces- These processor-FPGA hybrids are among the first 28nm sors from Cavium and NetLogic. The T4240’s performance chips targeting embedded designs. Freescale and NetLogic could exceed these chips, however, owing to its speedy (soon to be acquired by Broadcom) also disclosed their CPUs. Freescale plans to sample the T4240 in 1Q12, giving the company a lead in bringing high- performance 28nm embedded processors to customers. In contrast with Freescale, NetLogic has been more forthcoming in its disclo- sures about its 28nm processors, reveal- ing select details for a broad line of XLP II processors. The XLP II family resem- bles the first-generation XLP line. The four-way superscalar CPU with four- way multithreading and the accompany- ing offload engines and interfaces receive a few updates, but their architectures are largely unchanged. NetLogic puts 28nm technology to good use, boosting per- formance and employing the larger tran- sistor budget to add CPUs, faster accel- erators, and faster interfaces at the high end and to reduce cost and power throughout the product line. Figure 1. Xilinx Zynq versus Altera SoC FPGA. Bubble size indicates number of Anchoring the top end is the DSP units. (Source: vendors) XLP980 (a member of the XLP II line).

JANUARY 2012 Hybrids Bud on Embedded Landscape 3

Setting a new high-water mark for embedded-processor users; operators mainly deploy these systems to improve capabilities, this chip supports 80 threads (20 physical coverage. Simultaneously, studies show that small cells for CPUs)—2.5x that of NetLogic’s most advanced 40nm pro- public access are an economical alternative to new macro cessor, the XLP832. Multichip scaling doubles to eight sites for extending coverage in rural areas and for improv- sockets, enabling 160-way SMP configurations. Packet- ing capacity in cities. Enthusiasm has thus grown for sys- processing performance is also 2.5x greater, reaching tems supporting 32 or more users and higher bandwidth 100Gbps. To satiate this voracious appetite for data, Net- than residential femtocells provide. Meanwhile, LTE de- Logic outfitted the processor with 40Gbps interfaces. ployments are ramping. At the opposite end of the performance spectrum, Net- The effect on chip suppliers is increased attention to Logic is readying a dual-core XLP II processor, the XLP208. serving these higher-end small cells. At the same time, the That name may sound familiar: the company already offers a market—although it is growing more slowly than was 40nm XLP208, which has been renamed the XLP208A to hoped—is sufficiently mature to attract established chip obliterate any confusion. Apparently, the company is taking companies. Broadcom, for example, entered the fray by a mulligan on the 40nm version, presumably to reduce cost acquiring Percello. is active in 3G femtocells, and power while bumping up performance. OEMs that have the newest of which serve 16 and 32 users (up from 8 in its been developing products using the 40nm version will prior generation). These new designs integrate Qualcomm’s likely convert to 28nm for production. proprietary Hexagon DSP and speedy ARM-compatible The 28nm XLP332 is similar to the XLP832 but re- Scorpion CPU (best known as the CPU in Qualcomm’s duces cost through the process shrink, a smaller L3 cache, Snapdragon processors). elimination of multisocket SMP scaling, and halving the Picochip is the market leader, having many wins for number of DRAM interfaces to two. These changes make the residential 3G femtocells. For LTE femtocells such as those XLP332 better suited to cellular base stations and other deployed in 2011 by SK Telecom, the PC500 employs designs where the XLP832’s CPU performance is welcome Picochip’s massively parallel DSP technology (see MPR but other capabilities are overkill. NetLogic is targeting 7/28/03, “Picochip Preaches Parallelism”) but requires a 1Q12 for samples of the new XLP208 and the XLP332. The separate control processor, like Cavium’s Octeon II XLP980 is expected to sample in 2Q12. CN6330. By that time, Broadcom is expected to have completed Cavium, Freescale, and TI have all announced hybrids its acquisition of NetLogic. Broadcom brings sizable finan- for LTE small cells. In addition to the number of supported cial and engineering resources and a broad array of com- users and LTE capabilities, an important distinction between plementary physical-layer products to this market. It also these and most 3G products is their openness. OEMs can has a large sales force and established relationships with all program the CPUs and DSPs, whereas Picochip’s 3G femto- major communications-system vendors. Along with the cell products, for example, are turnkey ASSPs. Mirroring addition of NetLogic’s processors, these factors should help the chip market, startups have supplied a large portion of Broadcom challenge embedded leaders Intel and Freescale. femtocell systems. Programmable devices will allow major OEMs to add their own differentiation through software. In Small Cells Are a Big Attraction the long term, the market may morph into one supplied NetLogic has strengths of its own in the market. One of the solely by ASSPs—not that the chip design changes, but chip primary reasons behind Broadcom’s purchase is expansion suppliers provide all requisite software below the appli- of the company’s market opportunities. NetLogic has won cation layer. This pattern has been established in other high- cellular base-station designs with its XLS and XLP proces- volume markets where DSPs and CPUs have come together, sors, which Broadcom hopes to exploit further. To compete such as DSL, cable modem, and cellular handsets. in baseband processing, NetLogic’s portfolio requires a Freescale’s small-cell chip is straightforward. Already DSP, a gap that Broadcom may be able to fill by upgrading serving the market with products such as its P2020 micro- its femtocell technology. processor and MSC8156 DSP, the company combined ele- In the meantime, NetLogic’s main rivals, Freescale and ments of each to form the QorIQ Qonverge PSC913x. Cavium, are advancing their own infrastructure baseband (In fact, Freescale’s software-development platform for chips. Combining both CPU and DSP technology, these Qonverge was a board containing a P2010 and MSC8156.) chips are hybrids of a different sort. Directed specifically at The PSC9130 supports eight users and LTE, putting it a cellular base stations, they also include signal-processing rung above femtocell ASSPs from Picochip and Broadcom/ accelerators and seek to reduce system cost—an important Percello that target home use. For enterprise and public- factor in the proliferation of small-cell base stations. access small cells, the dual-CPU, dual-DSP PSC9132 sup- The small-cell market is in a state of flux. Roughly one ports more than 64 users and faster over-the-air rates. million femtocells have shipped. Because this number is less Having no DSP technology in its portfolio, Cavium than originally forecast, the industry has reduced its expec- set up a skunk works to develop an LTE baseband DSP and tations for residential systems, which typically support four collaborated with an OEM to craft accompanying software.

JANUARY 2012 4 Hybrids Bud on Embedded Landscape

The baseband DSP is not a single core but is separate DSPs and it incorporates AppliedMicro’s SlimPro. A novel for symbol, soft-bit, and control processing. Cavium feature, SlimPro is an on-chip management processor that combined this DSP and signal-processing offload engines operates in a secure domain. (See MPR 12/20/10, “First with CPUs and packet-processing blocks from its Octeon II PacketPro Chips Debut.”) family to create Octeon Fusion. The company is aiming for Cavium’s Octeon II CN6880 integrates more CPUs larger base stations than Freescale is with its initial Qon- and offload engines than any other communications-focused verge chips. The base Octeon Fusion CNF7120 supports 64 embedded processor. (The CN6880’s reign could be short. users, while other models support 256 or more. Tilera added offload engines to its manycore architecture This past year, long-time DSP powerhouse Texas and has begun sampling the first Tile-Gx products.) Per- Instruments also sampled its first CPU-DSP hybrids for formance on high-level applications will be modest owing small cells, employing its Keystone heterogeneous multi- to the moderate , simple CPU microarchitecture, core framework. Unlike Freescale and Cavium, which have and shallow cache. This chip was designed to tear up paral- their own CPUs, TI licenses an ARM Cortex-A8. The com- lelizable networking and storage workloads that handle lots pany also differs in that a DSP assists with scheduling—a of transient data. The CN6880 thus uses compact, efficient function handled by a CPU in other designs, diminishing CPUs and small, low-latency caches. (See MPR 5/31/10, the number of users served as bandwidth increases. TI is “Cavium Pushes Octeon to 32 CPUs.”) similar, however, in targeting LTE base stations for 64 or For communications applications where single-thread more users, as Figure 2 shows. performance is the dominant concern, Freescale brought the These highly integrated base-station chips will steal a dual-core QorIQ P5020 to production. The company’s first growth market from embedded processors. The number of 64-bit processor, the P5020, is the only communications- small-cell deployments doubled in 2011, and shipments— oriented embedded processor to top 2.0GHz, yet it con- despite growing slower than originally expected—are still sumes less than 30W. As NetLogic upped the ante in CPU increasing much faster than those of macro base stations. performance and as Intel reduced the power consumption For the next few years at least, macro cells will continue to and footprint of its solutions, Freescale found it necessary to use standalone processors, so the opportunity for embedded offer a much faster replacement for its aged MPC8641 and processors in base stations is not decreasing. In fact, it’s MPC8572 dual-core processors. (See MPR 7/5/10, “Free- increasing, owing to the added performance demanded by scale P5 Raises QorIQ’s I.Q.”) LTE. For now, the processing loads of these designs are too LSI’s Axxia processors uniquely combine network- great to be served economically by a single chip, and in processor and embedded-processor technology. The com- many cases, an ASIC handles DSP functions. Ultimately, pany could have just glued CPUs to its NPUs, but instead, it integration is as inevitable as Moore’s Law. decoupled the NPU engines so that data can flow among them in an arbitrarily defined pipeline. Compared with Many New Processors in 2011 other processors for communications, Axxia offers a much In 2011, AppliedMicro belatedly entered the dual-core age. more autonomous and programmable data plane. Harness- In many ways similar to Freescale’s P2020, the dual-core ing this clever architecture requires equally clever software. APM86290 is less expensive according to our estimates, LSI supplies many software modules targeting a narrow set of applications: ATM and IP transport, carrier Ethernet and cellular backhaul, and x86 offload. (See MPR 2/28/11, “LSI Expands Its Axxia Family.”) Whereas Cavium and LSI target the data plane with their processors and Freescale targets the control plane with the P5020, NetLogic strove to achieve the best of all worlds with the XLP832. For applications requiring maximum single- thread performance, large caches com- plement a CPU implementing a four- way superscalar microarchitecture with out-of-order execution. For parallel workloads, the CPUs implement four- way simultaneous multithreading. Copi- Figure 2. Comparison of LTE femtocell chips. Cavium targets the high end of ous offload engines, including a brigade the market, with Freescale at the low end and TI in the middle. Over time, of “MicroMIPS” CPUs for packet par- these product lines are likely to overlap more. sing, enable the processor to plow

JANUARY 2012 Hybrids Bud on Embedded Landscape 5 through networking tasks. (See MPR 7/26/10, “NetLogic Broadens XLP Family.”) We originally expected the Major Embedded-Processor Events of 2011 XLP832 to ship in 2010, but NetLogic did not qualify it for In the small-cell arena, Freescale (see MPR production until 2011. 2/21/11), Texas Instruments (see MPR 6/27/11), While AppliedMicro plots an ARM-based multicore and Cavium (see MPR 10/10/11) disclosed their first future, Marvell is living the dream. The first general- small-cell CPU-DSP hybrids, and Qualcomm purpose embedded processor with four ARM-compatible announced its newest chip for 3G femtocells (see MPR cores in production, Marvell’s Armada XP MV78460 com- 10/31/11). bines the company’s PJ4B CPUs with Ethernet, SATA, and Xilinx and Altera had previously revealed that PCI Express interfaces. At 1.6GHz and with 2MB of L2 they were integrating ARM CPUs in their FPGAs but cache, the processor will outrun competitors that use four only this year disclosed details. Coming to market Cortex-A9 CPUs. Its 10W power dissipation is well below sooner, Xilinx spoke up first (see MPR 3/7/11), that of many other quad-core chips. (See MPR 12/6/10, followed by its nemesis Altera (see MPR 10/31/11). The gate-array-based Spear-1300 from STMicro- “Marvell Lands a Quad.”) electronics is similar to these CPU-FPGA hybrids (see MPR 6/13/11). Integrate Or Else! AppliedMicro integrates similar features in its The blossoming of FPGA-CPU and DSP-CPU small-cell forthcoming ARM processors compared with its hybrids is the latest recurrence of the integration leitmotif PowerPC-based ones, but the company is also that characterizes the semiconductor industry. We expect targeting servers (see MPR 11/14/11). One of these these new crops of devices to gain quick acceptance among features is SlimPro (see MPR 6/20/11). designers. The new FPGA-CPU hybrids rectify the short- Notable new 28nm processor families include comings of past attempts at achieving this combination, and Freescale’s QorIQ Amp (see MPR 6/27/11) and integrated chips are a perfect fit for low-cost, high-volume NetLogic’s XLP II (see MPR 9/12/11 and MPR small cells. 10/10/11). NetLogic’s biggest news was its pending acqui- The FPGA-CPU combination could lure embedded sition by Broadcom (see MPR 9/26/11). The company designers with the siren song of creating a custom processor also announced a new variant of its 40nm quad-core with a single PC and a developer board. More likely, these XLP316 targeting cellular base stations (see MPR hybrids will appeal to designers of SoC ASICs that wish to 2/14/11). avoid ASIC nonrecurring expenses. Certainly, they will AMD exhibited another type of embedded hybrid, attract OEMs already pairing discrete CPUs and FPGAs the CPU-GPU combination (see MPR 2/7/11). owing to the reduction in system cost and power. Xilinx and LSI announced a new member of its Axxia family Altera thus emerge as a threat to established embedded-pro- (see MPR 2/28/11). cessor suppliers. Following a rough 2009, the embedded processor Integration has taken its toll on embedded suppliers in market grew 31%. The Linley Group reports share the past. Sometimes, the toll is subtle: embedded processors gains and losses (see MPR 7/11/11). Cortina unveiled a new processor for broadband retain their position in legacy systems while new integrated gateways (see MPR 9/5/11). chips win the latest hot-selling designs. Other times, the toll Texas Instruments showed off several Sitara is clear. DSL modems, for example, once had separate em- processors combining ARM Cortex-A8 CPUs, a GPU, bedded processors and DSPs before enterprising chip com- an Ethernet port, and other functions (see MPR panies put the two together to reduce system cost. In this and 11/14/11). similar cases, the outcome was that the embedded-processor supplier (typically Freescale/Motorola) was squeezed out as instruction-set stickiness proved far less than was hoped and as licensed CPUs proved flexible and economical. capitalizing on the integration trend instead of waiting to Embedded-processor suppliers can do little to counter become a victim of it, these companies will stay in the base- Xilinx and Altera. CPU technology is readily licensed, station game. whereas FPGA technology is closely held. The situation Failure to adequately respond in the past led not just to differs with small-cell CPU-DSP hybrids. Having vicari- losing out on designs but also to the emergence of a vibrant ously learned that integration threatens embedded-proces- CPU-IP business. This situation in turn made ARM the sor suppliers, Cavium opted to develop its own DSPs and most popular (in unit shipments) instruction set, thereby craft a combination chip. This time, Freescale has the luxury enticing at least one supplier to all but dump its incumbent of having both key technologies already in its portfolio. By architecture for one with an apparently brighter future. ♦

To subscribe to Report, access www.MPRonline.com or phone us at 408-270-3772.

JANUARY 2012