Archive for September, 2013

Advances in CPUs, System Architecture, Heat Up the Performance Race at Hot Chips

Saturday, September 14th, 2013

Celebrating its 25th year, the annual Hot Chips Conference held at Stanford University last month lived up to its reputation for highlighting high-performance processor solutions as well as advances in low-power designs for mobile applications. One of the highest performance processors unveiled at the conference was the Power 8 – the next generation Power series processor developed by IBM. The processor chip was fabricated using a 22 nm silicon-on-insulator process that allows designers to pack 12 CPU cores, lots of level-2 cache (512 kbytes 8-way per core), a large embedded-DRAM level-3 shared cache (96 Mbytes, 12 x 8 Mbytes 8-way bank), PCI Gen 3 ports capable of 8 Gbits/s transfers, eight DRAM ports, and many other sub-processors to offload the CPU and manage power (Figure 1). All these functions add up to about 5 billion transistors, all squeezed into a 650 mm2 chip with 15 layers of metal that can be clocked at 4 GHz and consumes over 300 W.


Figure 1: Containing about 5 billion transistor, the Power8 CPU developed by IBM combines 12 processor cores, 96 Mbytes of embedded DRAM for a level 3 on-chip cache, dual gen 3 PCIe ports, and eight memory ports. A separate memory buffer chip that connects to each memory port handles four channels of DDR3 memory.


The high throughput of the multicore processor posed another challenge – transferring data from or to the memory at a rate that won’t starve the process cores. To solve that problem, along with the processor chip, IBM designers crafted a memory buffer chip dubbed Centaur that packs 16 Mbytes of cache and provides a 9.6 Gbyte/s high-speed interface to the processor. The Power 8 processor can support up to eight of the Centaur memory buffers to achieve a sustained transfer rate of 230 Gbytes/s between the buffers and the processors. Each buffer chip has four DDR3 memory channels that yield a peak throughput of 410 Gbytes/s when all 32 memory channels are transferring data. A fully configured processor socket can address up to 1 Tbyte.

Additional high-performance CPUs based on the SPARC architecture included the SPARC 64X+, a next-generation processor targeted at UNIX servers developed by Fujitsu, and the M6 SPARC processor for enterprise systems developed by Oracle. A companion ASIC to the M6, the Sixby, is a scalability and coherency directory chip to support the company’s highly-scalable enterprise systems.

Fujitsu’s design packs 16 processor cores, each capable of running two threads, a shared L2 cache of 24 Mbytes, dual DDR3 memory interface controllers, several hardware-based software accelerators for cipher, database, and decimal calculations, dual PCI gen3 I/O controllers, and many other enhancements. The SPARC 64X+ processor was fabricated on a 28 nm CMOS process, contains almost 3 billion transistors, 1500 signal pins, and can clock at over 3.5 GHz. A single processor chip can deliver a peak performance of over 448 GFLOPs and has a memory throughput of 102 Gbytes/s. Designed for use in systems with from 1 to 64 CPU sockets, the processors incorporate a high-speed interconnect that can transfer data at up to 25 Gbits/s, per lane directly between CPU sockets (about 70% faster than the company’s previous SPARC 64X CPU). A basic system building block comprises four CPU sockets and two crossbar chips, and a full 64-socket system would contain 16 building-block modules capable of executing 2048 program threads.

Similar in some features to the Fujitsu processor, the Oracle M6 is also fabricated in a 28 nm process, packs 12 processor cores, and integrates two 8-lane PCIe gen 3 ports. However the M6 will be capable of executing eight program threads per core vs the two threads per core of the SPARC 64X+, and incorporates 48 Mbytes of L3 cache vs no L3 cache on the Fujitsu processor. Four DDR3 memory schedulers, each capable of handling four memory channels provide a total of 16 DDR channels that can address a total of 1 Tbyte per CPU socket (Figure 2).

Figure 2: The Oracle M6 processor packs 12 processor cores, each capable of executing eight program threads, 48 Mbytes of L3 cache, a pair of 8-lane PCIe gen 3 ports, and four DDR3 memory schedulers that together can address a total of 1 Tbyte per CPU socket.

In the large systems targeted by the Power8 and the SPARC processors, the CPUs are only a small part of the overall system. At the other extreme, however, in handheld, laptop and desktop computers, a system-on-a-chip solution is typically the main component. One example of that, the Kabini processor developed by AMD, packs four Jaguar CPU cores, a high-performance graphics and multimedia engine with Display Port, HDMI, and VGA outputs, SATA, USB, PCIe interfaces, advanced power management, and still other functions (Figure 3). Implemented in a 28 nm process, the chip is only 105 mm2—one-sixth the area and transistor count of the IBM Power8 and close to 1/30th the power. The four processor cores share a 2-Mbyte level-2 cache and are fed through a 64-bit DDR3 memory interface that can transfer data at up to 10.3 Gbytes/s with DDR3-1600 memory DIMMs. Additionally, the graphics/multimedia core is based on the company’s RADEON HD8000 Graphics Core Next (GCN) architecture that can handle 4k by 2k resolution. The core includes a video codec engine that can encode H.264 streams, and a universal video decoder that handles over half-a-dozen codec formats.

Figure 3: A highly-integrated system-on-a-chip solution for notebook and other portable computing systems, the Kabini processor packs four Jaguar CPU cores that share a 2 Mbyte level-2 cache and a 64-bit DDR3 memory interface. Included on the chip is a high-performance graphics/multimedia processor based on the HD8000 Radeon graphics core.


Making its move into the low-power handheld systems arena, Intel showed off its Clovertrail+, an SoC  targeted at smartphones that is a significant upgrade over Intel’s Medfield based smartphone solutions. Implemented in a 32-nm high-k metal-gate process, the Clovertrail+ (Atom Z2580) contains dual Atom CPU cores, a dual-core 2D/3D graphics processor, a multi-standard video decoder capable of handling 1080p/60 Hz video, a video encoder able to encode 1080p/30Hz video, a camera/imaging subsystem based on a programmable very-long-instruction word SIMD vector processor, a security engine, and many other control, power management, and system interface support functions (Figure 4).

Figure 4: Taking aim at next-generation smart phones, the Clovertrail+ system-on-a-chip platform developed by Intel employs two dual-thread Atom processor cores that can run at up to 2 GHz. The chip contains a 2D/3D dual core graphics engine as well as dedicated video decode and encode blocks, an image signal processor, a crypto engine, and many other interface and control blocks.


Each Atom core has a 512 kbytes L2 cache, can execute two program threads and run at a top speed of 2 GHz. The processor’s memory interface can support up to 2 Gbytes of low-power DDR2 (533 MHz) and address up to 256 Gbytes over an eMMC 4.41 interface. The enhancements in the Clovertrail+ processor yield a doubling of overall performance vs the Medfield processor and up to a 3X improvement in graphics performance.

Also focusing on portable system solutions, the just-unveiled Richland processor detailed by AMD incorporates the company’s Turbo-Core temperature-smart technology to manage core performance and power consumption (Figure 5). The Turbo-core technology is designed to more effectively exploit temperature margins by detecting favorable thermal conditions in real time and adjusting operating voltage and frequency.

Figure 5: The Turbo-core temperature-smart technology developed by AMD measure core temperature and performs various calculations to determine the optimum operating frequency and voltage for the CPU and graphics cores on the Richland processor chip.


The processor contains two dual-core CPU modules, each based on AMD’s recently released Piledriver CPU core. Each module has a 2 Mbyte L2 cache that is shared across the two CPU cores in the module. Like the Kabini processor, the Richland also incorporates an HD8000 series graphics processing unit and multimedia accelerators to offload the CPUs. The chip delivers up to 29% higher CPU performance and 41% higher GPU performance than the company’s previous generation solution while keeping power consumption down, allowing systems to deliver 10 or more hours of idle operation, or over five hours of video playback. Also supported by the processor chip is AMD’s wireless display for Windows 8.1, a low-latency wireless interface that can stream HD video at 1080p/60 Hz along with rich audio playback.

With low power consumption on nearly every processor designer’s mind, a new body bias technology used in conjunction with deeply-depleted-channel (DDC) transistors developed by SuVolta promises to reduce processor power consumption by as much as 50%. Implemented in an ARM Cortex M0 processor to validate the concepts, the body bias network requires minimal routing resources and the DDC transistor performance can be optimized to reduce leakage current for high-performance designs, or reduce active power for low-voltage threshold devices, thus permitting the processor designers to optimize the performance and power. The M0 processor implemented with 65 nm design rules achieved comparable performance at half the power vs a standard 65 nm process, or delivered 35% better performance at comparable power levels.

These are only a few of the many presentations at the Hot Chips conference. To view the full conference program, go to

Dave Bursky
Semiconductor Technology Editor
Chip Design Magazine

©2019 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS